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This  letter  is  the  Annual  Progress  Report  for  our  research  program  supported  under 
DARPA-ONR  Contract  N00014-82-K-0727. 

During  the  period  of  1  July  1985  to  30  June  1986,  we  have  continued  to  make 
progress  on  the  acquisition  of  acoustic-phonetic  and  lexical  knowledge.  Specifically: 

•  We  have  concluded  our  studies  of  lexical  stress  and  improved  the  performamce 
of  the  lexical  stress  recognition  system.  The  system  is  composed  of  two  parts: 
a  syllable  detector  and  a  stress  determiner.  A  number  of  modifications  were 
made  to  the  syllable  detector,  including  the  introduction  of  more  robust  in¬ 
tervocalic  consonant  detectors,  new  algorithms  for  sonorant  detection,  and 
improvements  in  code  to  shorten  run  times  and  increase  user  flexibility  for 
system  development.  The  system  now  runs  approximately  three  times  faster, 
detects  sonorants  more  accurately,  makes  fewer  false  insertions,  and  is  more 
flexible. 

•  We  have  conducted  experiments  to  quantify  the  influence  of  phonetic  context, 
including  syllable  structure,  on  the  acoustic  properties  of  stop  consonants. 
Our  results  indicate  that  both  syllable  structure  and  phonemic  context  play 
a  significant  role  in  determining  whether  a  stop  will  be  released,  unreleased, 
or  deleted  altogether.  By  continuing  to  study  such  contextual  variations  and 
their  acoustic  consquences,  we  hope  to  eventually  implement  a  computational 
framework  that  incorporates  context  knowledge  in  phonemic  decoding. 

•  We  have  undertaken  an  investigation  to  capture  the  knowledge  that  humans 
use  to  read  spectrograms,  and  to  apply  this  knowledge  to  the  creation  of  an 
expert  system.  Humans  are  able  to  read  spectrograms  by  extracting  and  then 
integrating  the  relevant  acoustic  features,  using  rules  that  relate  the  underly¬ 
ing  phonetic  forms  to  their  acoustic  manifestations.  To  test  the  feasibility  of 
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developing  a  computer  system  that  mimics  such  a  process,  we  selected  a  task 
of  identifying  stop  consonants  drawn  from  continuous  speech.  Our  prelimi¬ 
nary  results  indicate  that  machine  performance  comparable  to  that  of  human 
experts  can  be  attained. 

•  We  have  begun  development  of  a  system  that  applies  vision  techniques  to 
extract  acoustic  patterns  in  speech  spectrograms.  By  processing  a  spectro- 
graphic  image  through  a  set  of  edge  detectors  and  combining  their  outputs, 
the  system  obtains  two-dimensional  objects  that  characterize  the  formant  pat¬ 
terns  and  general  spectral  properties  of  vowels  and  consonants.  Preliminary 
evidence  suggests  that  the  visual  characterizations  produced  by  this  process¬ 
ing  technique  may  provide  an  effective  alternative  to  traditional  descriptions 
of  acoustic-phonetic  events. 

•  We  have  initiated  development  of  an  articulatory  synthesizer,  LAMINAR,  ca¬ 
pable  of  synthesizing  speech  from  different  vocal  tract  configurations.  This 
new  speech  research  tool  takes  an  articulatory  configuration  in  the  form  of  an 
acoustic  tube,  and  generates  the  resulting  acoustic  output.  With  continued 
development,  the  system  could  realistically  model  many  time-varying  artic¬ 
ulatory  gestures,  thus  providing  a  useful  mechanism  for  speech  production 
experiments. 

We  are  including  with  this  report  copies  of  the  following  publications,  in  the  form 
of  theses  and  papers  presented  at  various  conferences,  written  with  ONR  support 
during  this  contracting  period: 

•  Chen,  F.  R.,  “Lexical  Access  and  Verification  in  a  Broad  Phonetic  Approach 
to  Continuous  Digit  Recognition.” 

•  Huttenlocher,  D.  P.,  “A  Broad  Phonetic  Classifier.” 

•  Leung,  H.  C.,  and  V.  W.  Zue,  “Visual  Characterization  of  Speech  Spectro¬ 
grams.” 

•  Unverferth,  J.  E.,  “Improvements  to  and  Extensions  of  an  Automatic  Lexical 
Stress  Determiner.” 
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•  Zue,  V.  W.,  “Utilizing  Speech-Specific  Knowledge  in  Automatic  Speech  Recog¬ 
nition.” 

•  Zue,  V.  W.,  and  L.  F.  Lamel,  “An  Expert  Spectrogram  Reader:  A  Knowledge- 
Based  Approach  to  Speech  Recognition.” 
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Abstract 

As  part  of  her  Master’s  Thesis,  Aull  constructed  a  Lexical  Stress  Determiner  for 
discrete  words.  Her  system  was  designed  to  determine  the  number  of  syllables  and  the 
lexical  stress  pattern  in  discrete  words.  The  purpose  of  this  thesis  is  to  make  her  system 
more  robust,  both  from  a  programmer’3  point  of  view  as  well  as  from  a  performance 
and  reliability  perspective. 
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Chapter  1 


Introduction 


As  part  of  her  Master’s  Thesis,  Aull  constructed  a  Lexical  Stress  Determiner  for 
discrete  words[l].  As  my  thesis,  I  propose  to  make  her  system  more  robust,  both  from 
a  programmer’s  point  of  view  and  from  a  performance  and  reliability  perspective. 

Aull  developed  this  system  in  the  course  of  studying  the  effect  of  lexical  stress 
information  in  large  vocabulary  speech  recognition.  Her  system  achieved  87%  accuracy 
in  determining  the  correct  number  of  syllables  and  the  proper  stress  pattern.  Her 
system  was  written  on  a  Symbolics  Lisp  Machine  and  was  designed  to  interact  with  the 
Spire]  16]  speech  tool  developed  at  the  MIT  Speech  Group.  The  system  was  automated 
such  that  you  could  speak  an  isolated  word  to  it  and  it  would  soon  return  the  stress 
pattern.  Because  Aull  concluded  her  work  two  years  ago,  extensive  updating  of  her 
code  was  needed.  The  efficiency  of  the  code  could  also  be  improved  to  speed  real-time 
performance.  Much  of  it  had  to  be  rewritten  in  order  to  run  properly  on  current  the 
Lisp  Machine  operating  system. 
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Her  system  consisted  of  two  main  components,  a  syllable  detector  and  a  stress 
determiner.  The  syllable  detector  had  problems  finding  boundaries  in  two  cases:  1) 
when  two  syllabic  nuclei  are  separated  by  a  sonorant  consonant  as  in  “zero”  and  2)  when 
there  are  no  intervening  consonants,  as  in  “react”.  Aull’s  syllable  detector  was  fairly 
accurate  but  relatively  inflexible.  It  was  not  very  good  at  finding  syllable  boundaries 
that  occurred  at  Vowel- Voiced-consonant- Vowel  transitions.  It  also  had  problems  with 
short  releases  after  consonant  stops.  The  stress  determiner  gave  only  one  answer  with 
no  indication  of  a  confidence  level.  This  is  a  handicap  when  the  system  is  used  as  a 
front  end  of  a  large  vocabulary  lexical  access  system.  If  a  mistake  is  made  in  stress 
determination,  then  there  is  no  way  to  find  the  correct  target  group  of  words.  Mistakes 
can  include  false  insertions  of  syllables,  deletions  of  syllables  and  incorrect  labeling 
of  stressed  syllables.  Because  the  stressed  syllables  provide  “islands  of  reliability” 
for  acoustic  information  within  the  word,  it  is  especially  important  that  the  system 
correctly  identify  them 

The  second  chapter  of  this  thesis  describes  lexical  stress.  It  explores  what  lexical 
stress  is  and  how  it  might  be  important  to  a  speech  recognition  system.  The  third 
chapter  describes  Aull’s  system  for  automatic  detection  of  lexical  stress  in  isolated 
words,  exploring  the  components  of  her  system  developed  by  others.  The  fourth  chapter 
explains  the  modifications  that  have  been  made  to  Aull’s  system  and  how  they  changed 
system  performance.  The  last  chapter  contains  conclusions  and  some  possible  directions 
for  future  development. 


Chapter  2 


The  Importance  of  Lexical  Stress 


2.1  What  is  Lexical  Stress? 

In  this  paper,  as  Aull  did,  I  will  be  dealing  exclusively  with  lexical  stress  in  isolated 
words.  This  isolates  a  stress  pattern  of  the  word  from  higher  order  effects  such  as 
intonation  and  sentential  stress. 

Historically,  stress  has  been  a  poorly  defined  concept.  Lexical  stress  can  be  de¬ 
scribed  from  several  points  of  view.  It  can  be  viewed  linguistically,  phonemically  and 
phonetically.  It  has  been  variously  described  as  the  force  with  which  a  syllable  is  said 
or  as  a  feature  composed  of  other  features  (i.e.  fundamental  frequency,  duration  and 
intensity)  [9],  However,  it  is  generally  agreed  that  what  we  perceive  as  stress  is  not  a 
feature  of  speech  (or  language)  unto  itself  but  is  rather  a  combination  of  other,  more 
basic,  features. 

This  chapter  briefly  describes  what  lexical  stress  is  and  then  explains  some  of  the 
motivations  for  wanting  to  look  at  lexical  stress  and  incorporating  knowledge  about  it 


into  speech  recognition  systems. 


2.1.1  Stress  in  Language 

Stress  is  a  perceived  parameter  —  it  is  easily  detected  by  a  human  listener.  Most 
languages  have  measurable  stress  effects  in  their  words.  In  Languages  like  French, 
Finnish  or  Polish,  the  stressed  syllable  is  fixed  on  a  certain  syllable  in  the  word  (such 
as  the  first  syllable  or  the  last  one).  These  languages  are  said  to  have  fixed  stress[8]. 

Other  languages,  most  notably  English,  have  what  is  called  free  stress,  meaning 
that  the  stressed  syllable  can  fall  anywhere  in  the  word,  stress  can  also  have  higher 
order  knowledge  incorporated.  In  these  languages  the  stressed  syllables  are  not  fixed. 
In  these  languages,  it  is  words  themselves  that  have  stress  patterns  associated  with 
them.  Sometimes  the  same  spelling  can  have  two  or  more  meanings  and  different  stress 
patterns  to  go  with  them  (e.g.  “permit”  and  “permit”).  This  is  especially  common 
when  the  same  word  represents  a  two  meanings  that  are  different  word  types  (like  in 
the  previous  example  where  permit  is  first  a  noun  and  then  a  verb). 

The  difference  between  stressed  and  unstressed  syllable  also  changes  from  language 
to  language(8].  French,  for  example  has  very  little  difference  which  means  that  all  their 
syllables  are  fully  articulated.  In  English,  on  the  other  hand,  many  syllables  are  not 
fully  articulated,  resulting  in  shortened  sonorant  regions  and  schwa’s. 

In  English  it  is  usually  true  that  a  word  will  have  a  given  stress  pattern  consistently. 
This  is  different  from  other  languages  where  there  is  either  fixed  stress  in  words  or  there 
is  not  enough  difference  in  the  stress  between  syllables  to  be  reliably  determined. 


2.1.2  Components  of  Stress 


Linguistically,  stress  is  considered  a  parameter  unto  itself.  The  same  can  not  be 
said  from  an  acoustic  point  of  view.  There  is  no  single  determiner  for  stress  which 
means  that  you  can  not  look  at  one  parameter  (energy  or  some  similar  measure)  which 
will  reliably  determine  the  stress  pattern  of  a  word. 

The  Four  Main  Correlates  of  Stress 

Through  many  studies,  it  has  been  determined  that  English  stress  is  primarily 
determined  by  four  parameters.  These  parameters  are  energy,  fundamental  frequency, 
duration  and  phonetic  quality  [9]. 

Energy  refers  to  the  measure  of  acoustic  intensity  of  the  syllable.  Syllables  said 
with  more  force,  exert  more  pressure  on  the  surrounding  air  which  shows  that  there  is 
more  energy  put  into  the  articulation  of  these  syllables.  The  absolute  amount  of  energy 
in  each  syllable  is  not  as  important  as  the  energy  ratios  within  the  word’s  syllables. 
Ratios  are  more  important  than  absolute  values  for  all  these  parameters  because  there 
is  a  great  deal  of  variability  in  speech,  not  only  between  different  speakers  but  also 
different  words  uttered  by  the  same  person [15]. 

Fundamental  frequency,  perceived  as  pitch,  is  also  a  main  component  in  the  deter¬ 
mination  of  stress.  A  syllable  with  higher  pitch  compared  to  another  syllable,  with 
all  else  being  equal  will  be  heard  as  the  stressed  syllable.  Many  experiments  have 
shown  that  it  is  not  necessarily  the  peaks  or  mean  values  of  the  fundamental  frequency 
that  correspond  to  stress  perception  but  rather  the  shape  of  the  F0  contour  that  really 
matters(9]. 

Duration  is  important  for  stress  perception  as  well.  In  general,  the  longer  the 
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duration  (relative  to  other  syllables)  the  more  likely  that  the  syllable  is  going  to  be 
perceived  as  stressed. 

Phonetic  quality  is  a  measure  of  how  fully  articulated  the  syllable  was.  Aull  mea¬ 
sured  this  parameter  when  she  labeled  the  qualifying  syllables  as  reduced,  that  is,  they 
were  short  and  had  little  energy  when  compared  to  other  syllables. 

How  the  Correlates  Come  Together 

Using  the  words  in  her  database,  Aull  found  that  no  single  parameter  was  a  very 
good  indicator  of  which  syllables  were  stressed.  For  example,  maximal  average  energy 
corresponded  to  the  stressed  syllable  84%  of  the  time  and  the  peak  of  the  fundamental 
frequency  corresponded  to  the  stressed  syllable  only  70%  of  the  time.  These  results 
were  in  good  agreement  with  previous  data. 

Fry[5]  found  that  both  duration  ratio  and  energy  ratio  were  important  cues  for  the 
judgment  of  stress.  He  further  found  that  the  duration  ratio  was  more  reliable  than 
the  energy  ratio.  Morton  and  Jassem[9]  found  that  changes  in  fundamental  frequency 
had  greater  effect  on  stress  perception  than  did  changes  in  either  energy  or  duration. 

2.2  Usefulness  of  Lexical  Stress  in  Speech  Recogni¬ 
tion  Systems 

The  obvious  question  is  that  of  the  potential  importance  of  lexical  stress  in  speech 
recognition  systems.  We  want  to  know  if  there  is  any  useful  information  contained 
in  the  stress  pattern.  For  this  report,  I  am  limiting  my  comments  to  isolated  words. 
When  continuous  speech  is  included,  higher  order  stress  patterns  and  rhythm  effects 


start  influencing  stress  patterns. 


2.2.1  Lexica]  Stress  and  “Islands  of  Reliability” 

Aull  and  Zue[2,15],  among  others,  claimed  that  stressed  syllables  were  reliable  places 
to  look  for  acoustic  information.  That  is,  acoustic  cues  were  much  more  robust  in  those 
areas.  They  further  note  that  spectrogram  reading  experiments  and  automatic  recog¬ 
nition  systems  tend  to  recognize  phonemes  around  stressed  syllables  more  accurately 
than  around  unstressed  syllables.  This  result  seems  to  be  true  in  humans  as  well.  Cole 
and  Jakimak(3]  found  that  it  took  subjects  longer  to  recognize  a  mispronounced  word 
when  the  syllable  was  unstressed  compared  to  when  it  was  stressed. 

2.2.2  Lexical  Access  and  Large  Databases 

After  doing  studies  on  a  lexicon  developed  from  the  Mirriam- Webster  Pocket  Dic¬ 
tionary,  Aull  found  that  lexical  stress  was  very  useful  in  reducing  the  expected  size  of 
word  candidates  in  a  recognition  system.  Studies  by  Huttenlocher  and  Zue[6]  indicate 
that  determination  of  broad  phonetic  classes  greatly  reduce  the  number  of  potential 
word  candidates  in  an  isolated  word  recognition  system.  Information  about  the  num¬ 
ber  of  syllables  and  their  stress  pattern  can  augment  the  phonetic  class  knowledge  to 
further  reduce  the  word  candidates  in  a  recognition  system,  giving  the  later  (and  more 
detailed)  processing  of  such  a  system  fewer  possibilities  to  investigate. 

All  the  evidence  seems  to  indicate  that  knowledge  of  lexical  stress  would  be  quite 
desirable  in  an  isolated  word  recognition  system.  The  information  about  stressed  sylla¬ 
bles  points  to  regions  that  tend  to  be  more  acoustically  reliable,  improving  recognition 
in  those  regions.  The  stress  pattern,  once  determined,  also  provides  an  additional  con- 


Chapter  3 


Aull’s  Lexical  Stress  Determiner 

3.1  System  Overview 

Aull’s  system  was  designed  to  determine  the  stress  patterns  of  isolated  words.  Her 
motivation  was  largely  to  determine  if  this  would  be  an  effective  way  to  reduce  the 
search  for  target  words  in  large  vocabulary  systems.  Aull’s  system  was  written  on  a 
Symbolics  Lisp  Machine  to  be  used  in  conjunction  with  a  Floating  Point  Systems  array 
processor.  The  system  had  as  an  integral  component,  Spire ,  a  speech  research  tool 
developed  within  the  MIT  Speech  Group. 

The  input  to  the  system  was  digitized  speech  with  no  additional  processing,  and  the 
output  was  a  time-aligned  stress  pattern  of  the  word.  The  time-aligned  stress  pattern 
corresponded  to  the  vowel  of  the  syllable  and  any  surrounding  sonorant  segments.  The 
system  labeled  the  syllables  as  either  “stressed”,  “unstressed”  of  “reduced”.  There 
could  be  only  one  stressed  syllable  in  any  word.  If  two  syllables  were  close  in  the  stress 
rankings,  the  system  labeled  a  second  choice  for  the  stressed  syllable. 
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The  system  was  broken  down  into  two  main  sections.  The  first  section  was  the  syl¬ 
lable  detector.  This  section  looked  for  sonorant  regions  and  also  looked  for  intervocalic 
consonants  whose  presence  indicated  a  single  sonorant  region  that  could  contain  two 
or  more  syllables.  The  second  section  was  the  stress  determiner.  It  performed  compu¬ 
tations  on  the  different  sonorant  regions  in  order  to  determine  their  stress  ranking. 


3.2  The  Computing  Environment 

As  mentioned  before,  this  system  was  developed  on  Symbolics  LM-2’s  that  were 
equipped  with  Floating  Point  System’s  FPS-100  array  processor.  The  system  was  built 
around  the  Spire  speech  tool  as  well  as  including  portions  of  systems  developed  by 
others  in  the  Speech  Group. 

The  Computing  Environment 

The  Lisp  Machines  provided  a  very  flexible  and  convenient  environment  in  which 
to  work.  Both  Spire  and  Aull’s  system  made  extensive  use  of  a  Flavor1  system  which 
is  part  of  the  Lisp  Machine  operating  system.  The  machine’s  large  virtual  memory 
and  networking  capabilities  allowed  the  the  system  to  work  with  a  great  deal  of  data. 
The  Lisp  Machine  also  has  excellent  facilities  for  system  development[l6].  The  Lisp 
language  itself  provided  an  exceptionally  flexible  and  easy  to  work  in  programming 
environment. 

The  Lisp  Machine  has  extensive  development  facilities  on  which  to  develop  an  in¬ 
teractive  system.  It  has  very  versatile  multiple  window  support  and  a  high  resolution 
bit-mapped  display.  The  speed  with  which  it  computes  needed  parameters  also  allows 

'Flavors  are  structures  which  are  easy  to  manipulate  and  facilitate  message  passing 
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convenient  interactive  research.  Using  a  mouse  speeds  interaction  with  the  computer 
and  is  more  “user-friendly”  than  using  the  keyboard  exclusively. 

The  system  (especially  the  pitch  detector  written  by  Seneff(l2])  used  the  FPS-100 
a  great  deal.  The  array  processor  gave  a  great  increase  in  speed  over  what  would  have 
been  possible  on  the  Lisp  Machine  alone. 

The  FPS-100  is  set  up  in  a  master/slave  configuration  with  the  Lisp  Machine.  It  sits 
idle  until  the  Lisp  Machine  sends  it  something  to  do.  Chunks  of  data  are  assembled 
in  the  Lisp  Machine  and  sent  out  to  the  array-processor.  The  array  processor  then 
performs  a  series  of  steps,  or  a  mini-program  (stored  there  by  the  Lisp  Machine)  on 
the  data  and  finally  sends  the  results  back  to  the  Lisp  Machine.  This  continues  until 
the  entire  waveform  (or  any  array)  has  been  passed  through  the  array  processor. 

The  Spire  Advantage 

Spire  is  an  interactive  speech  research  tool  developed  at  the  MIT  Speech  Group  by 
D.  Shipman,  D.  Scott  Cyphers  and  David  Kaufman.  It  has  been  evolving  for  several 
years  and  many  others  have  contributed  to  it. 

Spire  was  developed  on  Symbolics  Lisp  Machines,  mostly  for  the  reasons  stated 
above.  It  was  intended  to  be  a  replacement  for  other  speech  tools  that  existed  at  the 
time.  Its  original  implementation  by  David  Shipman  was  completed  in  1982.  Following 
that  Cyphers  and  Kaufman  completely  rewrote  Spire  in  order  to  make  it  more  flexible, 
improve  the  user  interface,  improve  data  management  and  increase  its  efficiency  (both 
in  run-time  and  in  memory  usage)[16]. 

As  described  oy  Cyphers[4j,  Spire  has  a  four  tier  display  system.  A  layout,  at  the  top 
of  the  hierarchy,  is  a  screen  of  data.  It  is  composed  of  displays,  that  are  like  windows. 


These  displays  hold  any  number  of  overlays,  that  are  essentially  drawing  methods. 
These  overlays  take  on  the  name  of  their  associated  atts.  The  atts  are  computations 
performed  on  the  data  and  are  displayed  in  the  manner  specified  by  the  overlay. 

Sptre  works  with  representations  called  utterances.  An  utterance  is  an  event  or 
an  instance  of  someone  saying  something.  That  definition,  while  not  very  rigorous,  is 
sufficient  for  my  purposes.  Attached  to  the  utterance  are  instances  of  flavors  called 
attributes.  It  is  the  attributes  which  define  the  atts. 

Sptre  allows  users  to  easily  define  new  computations  and  modify  old  ones.  Spire' s 
design  allows  easy  interaction  with  previously  computed  data.  The  display  system  is 
the  same  way;  it  is  very  flexible  and  easily  extendable.  It  is  these  characteristics  that 
make  Spire  desirable  as  a  speech  research  tool. 

It  is  a  combination  of  the  Spire  program  and  the  Lisp  Machine  support  that  allows 
systems  to  be  easily  built.  Since  many  of  the  structures  and  methods  needed  in  a  large 
system  are  already  present  in  Spire ,  it  makes  sense  and  saves  work  to  incorporate  it 
into  any  system  in  development. 

3.3  Syllable  Detection 

As  I  mentioned  before,  the  first  section  of  the  system  incorporated  a  syllable  de¬ 
tector.  Because  all  syllables  must  have  a  vowel  at  their  root,  this  part  of  the  system 
attempts  to  find  and  separate  all  the  vowel  regions  in  a  word.  The  syllable  detector 
itself  has  two  distinct  components.  The  first  is  Hong  Leung’s  broad  classifier  that 
was  developed  as  part  of  a  system  that  automatically  aligns  phonetic  transcriptions 
with  continuous  speech[14j.  The  second  section,  developed  by  Aull,  separated  sonorant 
regions  into  different  syllables  if  it  found  any  intervocalic  consonants. 


3.3.1  Leung’s  Broad  Classifier 


Leung’s  broad  classifier  was  the  first  stage  of  a  system  that  provided  a  time- 
alignment  of  a  phonetic  sequence  to  the  speech  waveform[7].  Aull  used  this  classifier 
for  her  system  to  obtain  a  broad  segmentation  of  the  speech  signal. 

The  approach  that  was  taken  was  to  first  find  acoustically  robust  regions  in  the 
waveform.  From  there,  more  detailed  analyses  could  be  made  in  appropriate  regions 
that  would  not  necessarily  meaningful  to  make  in  other  regions.  This  breaks  down  one 
large  problem  into  several  smaller  ones  that  are  more  easily  approached [14]. 

The  data  takes  the  structure  of  a  binary  decision  tree.  A  series  of  classifiers  make 
decisions  about  whether  or  not  a  time-slice  of  speech  has  a  certain  characteristic.  The 
classifiers  are  all  structurally  the  same  but  differ  in  the  parameters  that  they  look  at 
and  where  they  clip  their  values.  The  speech  is  analyzed  every  5  msec. 

A  representative  classifier  uses  M  parameters,  that  are  decided  by  previous  speech 
knowledge.  The  parameters  are  computed,  then  processed;  they  are  smoothed,  clipped 
and  then  normalized.  Now,  for  every  5  msec  we  have  an  M  dimensional  feature  vector. 
A  decision  is  made  in  this  M  dimensional  feature  space  through  a  K-Means  clustering 
algorithm.  In  this  manner  Leung  found  that  he  could  reliably  divide  the  utterance  into 
six  types  of  regions: 

•  S  (Sonorant)  :  vowel-like,  this  would  be  a  syllable  core. 

•  O  (Obstruent)  :  exhibits  high  frequency  “noise”. 

•  VO  (Voiced  Obstruent)  :  shares  characteristics  of  both  of  the  above. 
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The  segments  that  Aull  was  most  interested  in  were  naturally  the  sonorants.  This 
is  because  they  form  the  root  of  syllables  and  hence  were  the  segments  that  she  had 
to  find.  Leung’s  broad  classifier  was  very  good  at  determining  boundaries  between 
every  type  of  segments  except  for  different  voiced  segments.  To  find  harder  boundaries 
(Vowel- Vowel  for  example),  Aull  had  to  develop  her  own  algorithms. 

3.3.2  Aull’s  Intervocalic  Detectors 

After  the  initial  segmentation  by  Leung’s  system,  Aull  inserted  a  subsystem  that 
was  designed  to  find  intervocalic,  voiced  regions.  This  is  meant  to  include  both  voiced 
consonants  (like  the  “r”  in  “miracle”)  and  vowel-vowel  transitions  (like  the  “ie”  in 
“anxiety”).  These  phenomena  often  exhibit  themselves  through  formant  movements  or 
energy  dips,  but  not  always. 

All  of  these  detectors  made  extensive  use  of  spectral  weighting  windows,  specifically 
short-time  spectra  of  the  waveform  were  multiplied  by  a  frequency  weighting  function 
designed  to  bring  out  spectral  characteristics  that  were  expected  in  certain  frequency 
ranges.  Then  the  results  of  the  multiplication  are  then  accumulated  into  a  Center  of 
Gravity  function.  The  center  of  gravity  function  is  as  follows[lj: 

Center  of  Gravity  =  W(/)  S(f) 

/=f. 


where 

W(f) 

S(f) 

FuF2 


linear  weighting  window 
spectrum  value  at  f 
frequency  range 
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Although  Leung’s  system  performed  rather  adequately  on  intervocalic  nasals,  Aull 
developed  a  subsystem  just  for  that  purpose.  In  this  part  of  the  system  Aull  did  not 
choose  to  use  a  spectral  weighting  scheme  but  rather,  she  looked  for  a  robust  drop  in 
energy  in  the  frequency  region  of  the  first  three  formants  (F^  F2,  and  Fj).  As  a  result 
of  Leung’s  segmenter  and  Aull’s  detector  the  system  was  quite  robust  in  determining 
nasal  boundaries. 

Leung’s  system  was  not  quite  as  good  at  detecting  semivowels  (/l/,  /w/).  These 
are  often  characterized  by  a  drop  in  energy  similar  to  the  nasals  except  that  only  F2 
and  F3  show  a  significant  drop.  The  drop  in  energy  is  more  gradual  than  in  the  case 
of  nasals. 

Even  harder  to  detect  were  intervocalic  semivowels  /r/  and  /y /.  These  are  char¬ 
acterized  by  a  concentration  of  energy  around  2KHz.  There  are  sometimes  dips  (at 
least  for  /r/)  in  formant  frequencies  as  well,  but  far  from  always.  The  shape  that  these 
semivowels  take  in  the  frequency  domain  are  very  context  dependent  and  are  hence 
difficult  to  detect.  Leung’s  system  generally  misses  these  completely.  Aull  used  a 
spectral  weighting  window  that  emphasized  2000  Hz  and  300  Hz  while  deemphasizing¬ 
emphasizing  the  frequencies  around  1100  Hz.  In  this  way  she  can  label  regions  as  r-like 
or  not  r-like!  lh 

The  hardest  types  of  intervocalic  activity  to  detect  are  the  vowel-vowel  transitions. 
For  this  type  of  decision,  Aull  used  spectral  weighting  windows  that  attempted  to 
emphasize  these  changes.  She  took  advantage  of  speech  knowledge  to  determine  window 
that  would  emphasize  transitions  between  different  types  of  vowels.  Even  so,  these 
changes  are  not  very  robust  and  are  difficult  to  detect  under  the  best  of  circumstances. 

The  syllable  detector  was  designed  to  identify  the  sonorant  regions  of  speech  for 
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Figure  3.2:  Two  examples  of  /r/’s  in  isolated  words  (note  the  differences) 


further  analysis.  Leung’s  broad  classifier  first  separated  the  speech  into  acoustically 
robust  segments.  After  that,  Aull  applied  a  series  of  processes  designed  to  further 
separate  the  sonorant  regions  by  determining  if  there  were  any  intervening  regions 
between  syllable  cores. 


3.4  Stress  Determination 

As  I  mentioned  before,  it  has  been  found  that  fundamental  frequency  (pitch),  dura¬ 
tion  and  spectral  energy  are  good  correlates  of  what  we  perceive  as  lexical  stress.  The 
problem  that  Aull  encountered  though,  was  that  any  one  of  these  parameters  could  not 
determine  the  stressed  syllable  correctly  more  than  87%  of  the  time.  As  a  result  she 
determined  that  using  all  of  these  parameters  (as  well  as  one  other,  spectral  change) 
was  more  reliable  than  using  any  one  of  them  in  determining  the  relative  stress  of 
syllables  in  isolated  words. 

3.4.1  Acoustic  Parameters 

One  of  the  parameters  that  Aull  looked  at  was  duration.  She  used  the  sonorant 
region  found  by  tbe  front-end  as  the  basis  for  her  duration  measurement.  The  sonorant 
boundaries  were  determined  within  5  msec.  Any  more  accuracy  would  have  been  un¬ 
necessary  due  to  the  uncertainties  involved  with  the  determination  of  the  boundaries. 
From  her  own  studies  and  those  by  others,  she  found  that  the  final  syllable  or  sonorant 
region  must  have  its  length  adjusted  for  an  effect  called  prepausal  lengthening,  i.e.  the 
lengthening  of  the  final  syllable  in  an  isolated  word. 

Aull  then  looked  at  the  energy  over  two  bands  extending  from  400  Hz  to  5000  Hz 


and  from  1200  Hz  to  3300  Hz.  These  energies  were  picked  to  cover  the  range  of  sonorant 
regions  and  to  deemphasize  energy  regions  associated  with  consonants. 

Fundamental  frequency  or  pitch  was  the  third  parameter  to  be  measured.  The 
pitch  was  determined  by  using  an  enhanced  waveform,  which  enhances  the  funda¬ 
mental  periodicity  and  then  using  an  Average  Magnitude  Difference  Function  of  the 
waveform[12,l|.  Aull  also  mentioned  that  the  peak  value  of  the  pitch  seemed  more 
significant  in  determining  stress  than  its  average  value  because  of  differences  between 
isolated  words  and  continuous  speech. 

Another  parameter  that  Aull  incorporated  was  spectral  change[14].  This  parameter 
was  a  measure  of  change  of  energy  in  sonorant  regions.  The  energy  change  was  measured 
across  several  energy  bands  according  to  the  following  formulas: 


This  parameter  was  used  because  it  was  found  that  stressed  syllables  were  more 
acoustically  stable  than  unstressed  ones.  This  parameter  was  only  extracted  in  the 


central  parts  of  the  sonorant  regions  so  that  the  surrounding  regions  could  not  influence 
the  spectral  change  measurement. 


3.4.2  Stress  Determination  Algorithm 

Aull  had  to  combine  these  parameters  into  one  meaningful  measurement  of  stress. 
She  initially  tried  K-means  Clustering  techniques  but  found  that  they  did  not  perform 
adequately.  The  main  problem  with  any  system  that  looks  across  a  group  of  words  is 
that  there  is  too  much  variability  across  isolated  words.  She  then  dropped  this  and 
other  methods  that  required  accumulating  statistics  across  many  instances  of  isolated 
speech  and  instead  adopted  a  method  that  used  only  the  particular  word  that  the 
system  was  currently  working  on. 

She  associated  a  five-dimensional  feature  vector  with  each  sonorant  region.  Then, 
for  each  parameter,  the  system  determined  the  maximum  value  across  all  the  sonorant 
regions  and  collected  them  into  a  maximum  feature  vector.  This  maximum  feature 
vector  was  the  basis  to  which  the  sonorant  regions  in  the  word  were  compared.  This 
reduced  interword  variability. 

A  Euclidean  distance  from  the  maximal  feature  vector  to  each  sonorant  feature 
vector  was  determined.  The  region  with  the  shortest  distance  was  considered  to  be  the 
stressed  syllable.  The  other  syllables  in  the  word  were  all  labeled  unstressed.  Further 
processing  determined  which  sonorant  regions  were  reduced  by  looking  at  their  energy 
and  duration. 
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Figure  3.3:  An  Example  of  Stress  Determiner  Output.  The  numbers  correspond  to 
ranking  and  the  letters  mean  Stressed,  Unstressed  or  Reduced. 


3.5  System  Performance 


i 

Aull  tested  her  system  on  a  1600  word  database.  Her  system  correctly  determined 
the  stress  pattern  87%  of  the  time.  3%  of  this  error  was  due  to  confusion  between 
unstressed  and  stressed  syllables  within  the  word.  The  other  10%  corresponded  to 
either  missing  a  sonorant  region,  failing  to  insert  a  boundary  in  the  case  of  intervocalic 
phenomena  or  false  insertion  of  a  region  or  boundary. 

Aull  determined  that  the  acoustic  correlates  of  lexical  stress,  as  determined  by  Fry  [5] 
and  others,  were  quite  adequate  for  determining  the  stress  in  a  word.  She  did  find, 
much  as  she  expected,  that  her  system  performance  degraded  as  acoustic  cues  became 
more  subtle. 

3.6  Summary  of  Aull’s  System 

Aull’s  system  consisted  of  two  main  subsystems,  a  syllable  detector  and  a  stress 
determiner.  The  syllable  detector  was  made  up  of  Leung’s  acoustic  front  end  and  Aull’s 
intervocalic  detectors.  The  stress  determiner  extracted  a  five-dimensional  feature  vector 
from  the  sonorant  regions.  These  parameters  have  been  experimentally  determined  to 
influence  perception  of  stress.  The  feature  vectors  were  then  compared  to  a  maximal 
vector  for  stress  determination. 


Chapter  4 

Modifications  of  Aull’s  System 

4.1  System  Flaws 

Aull’s  system  was  very  good  but,  like  all  systems  of  its  type,  was  not  perfect.  Aull 
measured  the  system  at  87%  accuracy.  That  figure  refers  to  correct  determination  of 
the  syllables  and  the  stress  pattern.  She  found  that  3%  of  the  time,  the  stress  pattern 
was  not  determined  correctly  by  the  system.  This  means  that  10%  of  the  time,  there 
was  a  problem  in  finding  the  syllables  correctly.  These  errors  correspond  both  to  false 
insertions  and  false  deletions. 

Thus  the  largest  problem  that  the  system  had  was  in  the  area  of  syllable  detection. 
This  part  was  difficult  because  it  relies  on  acoustic  cues,  some  of  which  can  be  quite 
ambiguous.  The  stress  determiner,  while  not  perfect,  is  more  robust  than  the  syllable 
locator  because  the  parameters  used  to  determine  stress  have  been  heavily  studied  and 
are  fairly  well  understood.  While  this  section  also  relies  on  acoustic  parameters,  it  is 
constrained  to  the  boundaries  determined  by  the  syllable  detector. 
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One  of  the  biggest  problems  was  that  Aull’s  system  was  written  almost  two  years 
ago.  Much  of  what  she  had  done  was  unexplained.  The  computers  that  the  group 
currently  works  with  are  different  from  the  ones  that  Aull  worked  on  (though  they 
were  still  Symbolics  machines  and  retained  a  great  deal  of  compatibility).  Also,  both 
the  Lisp  Machine  operating  system  and  Spire  had  undergone  several  major  changes. 
These  conditions  added  up  to  the  fact  that  much  of  the  existing  code  had  to  be  changed 
in  order  to  get  the  system  running  as  before. 

4.1.1  Problems  with  Syllable  Detection 

Aull’s  system,  as  I  stated  before,  was  evaluated  at  about  a  10%  error  rate  for  the 
syllable  detection  section  of  the  system.  The  system  performed  quite  well  in  identifying 
syllables  that  are  separated  by  obstruents  (as  in  “duplicate”)  .  These  boundaries  were 
correctly  determined  by  the  acoustic  front  end  and  required  little  additional  processing. 

The  syllable  finder’s  performance  decreased  as  the  consonantal  regions  between 
vowel  regions  became  less  obstruent-like.  This  lowering  of  performance  is  due  to  the 
fact  that  some  intervocalic  voiced  consonants  appear  more  vowel-like  than  others.  As 
mentioned  before,  /r/’s,/l/’s  are  always  hard  to  find,  because  sometimes  they  take  on 
vowel-like  acoustic  properties. 

The  system  also  had  trouble  with  vowels  whose  amplitudes  are  low.  This  phe¬ 
nomenon  occurs  in  reduced  vowels.  Some  people  reduce  them  more  than  others  and 
sometimes  the  reduction  results  in  a  deletion  of  the  region.  Even  when  tnere  was  a 
very  short,  low  energy  sonorant,  a  human  listener  will  still  detect  a  syllable  there.  The 
solution  is  so  detect  these  regions  and  then  eliminate  any  false  alarms  resulting  from 
from  making  the  system  more  sensitive. 
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Finally,  the  system  has  a  great  deal  of  difficulty  with  syllables  that  are  not  separated 
by  consonants  due  to  the  fact  that  the  acoustic  cues  for  vowel-vowel  transitions  may 
be  subtle  and  are  not  well  understood. 

4.1.2  Problems  with  Stress  Determination 

While,  this  component  proved  to  be  more  reliable  than  the  syllable  locator,  it  still 
had  several  problems.  The  largest  problem  was  that  it  was  not  flexible  enough  for  lexical 
lookup  into  a  large  lexicon.  The  system  provided  a  stress  pattern  and  no  additional 
information.  This  means  that  the  system  will  either  be  right  or  wrong,  there  is  no 
margin  of  error.  There  is  no  second  choice  or  quality  of  decision  information.  An 
improvement  in  this  part  of  the  system  would  allow  more  flexibility  and  would  go  a 


long  way  to  remedying  the  problem  of  misidentifying  a  word’s  stress  pattern. 


4.2  System  Code  Changes 


Almost  two  years  elapsed  between  when  Aull  finished  her  research  and  when  I 
started  to  look  into  her  system.  Unfortunately  the  system  and  machine  that  that  her 
stress  determiner  ran  on  did  not  remain  static  through  that  time.  The  Speech  Group 
updated  its  machines  to  the  newer  Symbolics  3600  Series  Lisp  Machines,  Symbolics  also 
introduced  numerous  changes  in  its  operating  system,  and,  most  significantly,  Spirt 
was  extensively  rewritten  by  D.  S.  Cyphers  and  David  Kaufman.  All  these  changes 
contributed  to  the  work  that  had  to  be  done  in  order  to  return  the  system  to  its  former 
status  and  hopefully  beyond. 


4.2.1  Updating  System  Code 


The  first  thing  to  be  done  was  to  rewrite  the  code  so  that  it  would  run  again. 
Getting  a  system  that  would  run  and  one  that  would  run  correctly  turned  out  to  be 
two  different  things.  To  get  the  system  running  mostly  entailed  recompiling  the  system 
and  making  some  simple  changes  that  included  changing  message  names  and  other 
system  updates. 

Much  of  the  system’s  Spire  interface  had  to  be  rewritten  in  order  to  run  properly. 
Since  Aull  had  finished  her  work,  Spire  had  changed  a  great  deal.  Both  Spire  dis¬ 
plays  and  the  representation  of  time-aligned  data  had  changed  incompatibly.  In  both 
cases  (operating  system  and  Spire)  there  were  also  subtle  changes  that  affected  system 
performance.  These  had  to  be  corrected  individually  as  they  were  found. 

4.2.2  Improving  System  Efficiency 

Improving  system  efficiency  and  run-time  performance  was  a  different  issue  from 
updating  the  code.  After  the  code  had  been  updated,  it  was  found  that  there  were  many 
places  that  would  benefit  from  being  rewritten  or  modified.  Some  of  the  modifications 
were  for  the  sake  of  computation  efficiency  and  others  were  done  in  order  to  make  the 
code  more  compact  and  smoothly  flowing.  The  biggest  change  that  were  made  had  to 
do  with  the  way  in  which  segments  and  their  boundaries  were  accessed. 

The  major  running-time  improvement  was  contributed  by  Seneff  who  wrote  a  version 
of  the  Gold-Rabiner  pitch  detection  algorithmfl  1].  This  algorithm  was  much  faster  than 
the  algorithm  then  being  used.  It  seems  that  the  system  changes  introduced  by  this 
author  have  also  improved  the  run-time  performance  of  the  system  but  it  is  difficult 
to  substantiate.  The  speed  of  the  system  was  further  improved  by  numerous  hardware 
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modifications  to  the  Symbolics  machines.  The  system  now  runs  at  least  three  times 
faster  than  it  did  before,  or  about  95  times  real  time. 


4.2.3  Improvements  in  System  Flexibility 

The  system  as  Aull  left  it  was  rather  rigid  in  that  parameters  that  went  into  com¬ 
putations  could  not  be  modified  or  accessed  easily.  Modification  of  computational 
parameters  is  a  task  that  Spirt  makes  easy.  What  was  done  was  to  make  these  pa¬ 
rameters  changeable  from  Spire  so  that  it  was  not  necessary  to  constantly  recompile 
the  system  code  when  changing  numbers  or  parameters.  This  change  facilitated  the 
development  stage,  when  thresholds  were  specified  iteratively  in  order  to  minimize  both 
false  insertions  and  deletions  of  segments. 


4.3  Changes  in  Syllable  Detection 

The  syllable  detection  section  was  broken  into  two  different  parts  for  the  purposes 
of  modification.  They  followed  the  natural  division  of  this  section,  that  is  Leung’s  front 
end  and  Aull’s  detectors  for  intervocalic  events. 

4.3.1  Improving  Sonorant  Detection 

The  first  goal  was  to  improve  the  sonorant  detection  in  the  acoustic  front  end.  This 
is  an  important  step  because  if  a  sonorant  region  was  not  detected  there,  it  would  be 
unavailable  for  of  all  subsequent  processing.  However,  erroneously  inserted  segments 
arising  from  making  the  system  more  sensitive  to  sonorant  regions  could  be  eradicated 
in  later  system  components. 
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The  front  end  was  missing  sonorant  regions,  but  was  also  falsely  breaking  up  valid 
regions.  That  is,  it  would  first  label  a  region  sonorant  and  then  insert  another  label 
in  the  middle  of  it.  This  had  the  result  creating  two  invalid  sonorant  regions  from  one 
good  one. 

The  solution  to  resolving  the  missing  sonorant  problem  was  found  in  the  K-Means 
clustering  algorithm  used  in  the  system.  The  initial  step  in  that  algorithm  was  to  estab¬ 
lish  clipping  values  for  each  of  the  parameters  investigated.  The  clipping  values  were 
the  extremes  in  the  data-space  that  a  given  type  of  region  was  expected.  These  clip¬ 
ping  values  were  reevaluated  iteratively  so  that  bad  regions  were  minimized,  while  low 
amplitude  sonorants  were  maximized.  The  effect  was  that  many  low  energy  sonorants 
were  found,  while  few  false  insertions  resulted. 

In  order  to  decrease  the  number  of  false  insertions  into  the  middle  of  sonorant 
regions,  some  of  the  processing  done  in  the  front  end  had  to  be  eliminated.  The  system 
would  initially  find  and  label  sonorant,  obstruent  and  silent  regions.  It  would  then 
segment  the  regions  further  by  looking  for  different  acoustic  cues  within  these  regions. 
It  is  in  this  later  processing  that  the  errors  (the  false  insertions  into  the  sonorant 
regions)  usually  occurred.  The  key  to  solving  this  problem  was  to  determine  at  what 
point  in  the  processing  the  most  errors  were  inserted  while  not  missing  too  many  valid 
regions.  It  was  determined  that  some  processing  after  the  initial  labeling  was  necessary 
in  order  to  keep  the  number  of  false  insertions  down  to  a  minimum. 

After  changing  the  front  end,  more  problems  had  to  be  dealt  with.  First,  many  of 
the  new  sonorant  segments  were  discarded  by  a  module  that  tried  to  decide  what  was 
really  a  sonorant  and  what  wasn’t.  This  part  of  the  system  looked  at  the  duration, 
energy  and  spectral  change  of  the  sonorant  region,  and  if  the  region  was  too  short 
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and/or  had  too  little  energy,  it  was  deleted  from  the  syllable  list. 

The  thresholds  at  which  the  system  would  cut  off  sonorant  candidates  was  changed 
iteratively.  Once  the  levels  were  changed,  it  had  to  be  ensured  that  the  falsely  labeled 
regions  were  kept  to  a  minimum  while  the  truly  sonorant  regions  detected  were  max¬ 
imized.  In  the  end,  all  the  average  energy  thresholds  needed  to  qualify  a  segment  as 
sonorant  were  lowered.  At  the  same  time  it  was  determined  that  the  average  length 
of  the  falsely  detected  segments  was  less  than  even  the  most  reduced  real  sonorant  re¬ 
gions.  Because  of  this,  I  lowered  the  durational  threshold  as  well.  This  allowed  the  low 
energy  sonorants  to  be  detected  while  still  keeping  the  false  indications  to  a  minimum. 

Spectral  change  is  used  as  a  parameter  in  this  computation  because  Aull  felt  that  if  a 
region  exhibited  a  great  deal  of  spectral  change  then  it  was  less  likely  to  be  a  sonorant 
than  a  more  spectrally  static  segment.  This  conclusion  is  not  exactly  obvious  for 
segments  of  such  short  duration,  but  the  inclusion  (or  modification)  of  this  parameter 
has  not  caused  any  system  deterioration.  Because  the  spectral  stability  gives  another 
clue  to  the  segment’s  identity,  impostor  sonorants  of  greater  duration  can  be  more 
reliably  removed  from  consideration. 

4.3.2  Improving  Detection  of  Intervocalic  Consonants 

This  is  the  section  that  proved  to  be  the  most  disappointing  in  terms  of  improving 
system  performance.  While  it  seemed  to  make  moderate  performance  gains  in  /l/ 
detection  through  changing  some  thresholds  it  had  greater  difficulty  with  /r/’s. 

The  problem  was  that  if  the  system  were  made  more  sensitive  to  the  spectral  move¬ 
ment  that  often  occurs  with  intervocalic  /r/’s  (as  in  “interrupt”),  it  would  then  get 
more  false  boundary  insertions  at  /r/’s  that  were  not  intervocalic  (as  in  “cohort”). 
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This  trade-off  between  insertions  and  deletions  was  a  problem  that  had  always  plagued 
the  system.  Maybe  a  different  parameter  would  have  been  better  but  time  did  not 
permit  an  investigation  of  this  possibility.  One  parameter,  a  spectral  first  difference, 
was  investigated  but  preliminary  results  indicated  that  it  was  not  useful  for  intervocalic 
/ r/  detection. 

Another  difficult  problem  was  that  of  vowel-voweltransitions.  Detecting  these  tran¬ 
sitions  reliably  is  difficult  because  of  the  many  different,  often  subtle  changes  they  pro¬ 
duce  in  the  spectrum  of  the  word.  Sometimes  the  changes  can  be  very  obvious  while  at 
other  times,  they  can  manifest  themselves  through  slow  formant  changes.  The  difficulty 
in  finding  and  interpreting  them  is  compounded  by  their  variability  from  speaker  to 
speaker. 

The  previous  two  problems  received  much  attention,  mostly  in  the  form  of  changing 
parameters  and  thresholds,  both  in  the  acoustic  front  end  and  in  Aull’s  intervocalic 
detectors.  Unfortunately  both  met  with  little  success. 

4.4  Changes  in  Stress  Detection 

The  biggest  problem  with  the  stress  determination  mechanism  was  that  it  was  not 
very  flexible.  In  addition  it  sometimes  failed  to  correctly  find  the  stressed  syllable  all 
the  time.  A  large  part  of  this  second  problem  can  be  attributed  to  the  variability 
with  which  sonorant  regions  surrounding  the  vowel  are  included  in  the  segment.  All 
other  things  being  equal,  the  region  that  is  longer  will  be  considered  stressed  by  the 
system.  This  could  be  a  problem  when  two  regions  are  similar  in  the  amount  of  stress 
that  can  be  attributed  to  them  and  one  region  is  significantly  longer  than  the  other. 
Different  weighting  functions  for  the  parameters  in  the  distance  metric  were  tried  but 


this  provided  no  measurable  change  in  the  stress  determination. 

To  improve  the  flexibility  of  the  system,  new  methods  to  provide  data  for  further 
processing  of  the  stress  information  were  considered.  Time  did  not  permit  proper 
investigation  of  the  usefulness  of  the  results  of  these  methods,  but  it  is  felt  that  they 
could  contribute  to  overall  system  performance.  Also,  it  was  felt  that  this  section  was 
not  as  critical  as  others  because  this  aspect  of  the  system  performed  relatively  well. 

a  way  to  obtain  the  actual  measurements  of  the  system  (rather  than  just  “stressed” 
or  “unstressed”)  was  provided  .  This  allows  one  to  look  at  results  of  the  Euclidean 
distance  measurement  across  the  M  dimensional  feature  vector.  Another  addition  that 
was  made  (and  kept  in  the  system  because  it  both  provided  additional  information  and 
was  easily  interpreted)  was  the  inclusion  of  the  ranking  of  the  syllables  rather  than  just 
labeling  them  “stressed”,  “unstressed”  or  “reduced”.  This  allows  the  user  to  see  what 
the  output  of  the  system  is  more  clearly. 

4.5  System  Evaluation 

The  system  was  tested  on  228  isolated  words  spoken  by  six  speakers  (3  male  and  3 
female).  These  words  were  taken  from  databases  used  by  Aull.  Her  system  returned 
errors  on  all  these  words  at  some  point.  Some  of  the  words  were  evaluated  correctly 
by  her  final  system  but  were  included  to  determine  if  changes  to  the  system  degraded 
performance  on  data  that  was  already  valid. 

In  Table  4.1,  the  V-V,  Cons,  and  Son  columns  all  correspond  to  a  miss  in  either 
the  vowel-vowel(  like  in  “anxiety”),  consonant  (such  as  / r/  or  /l/)  or  sonorant  (as  in 
the  last  syllable  of  “action”)  contexts.  The  Insert  column  refers  to  false  insertions  of 
sonorant  regions  and  the  Bad  Stress  refers  to  incorrect  stress  assignment.  The  numbers 


Table  4.1:  Evaluation  Results 


Cons. 

Bad  Stress 

Original 

24 

46 

37 

63 

7 

Modified 

24 

47 

22 

56 

5 

are  total  number  of  that  type  of  error  resulting  from  evaluating  the  data  base.  This 
was  done  because  some  words  resulted  in  more  than  one  error  while  others  resulted  in 
none. 

The  number  of  missed  vowel-vowel  transitions  did  not  change  at  all.  This  was 
expected  because  nothing  was  done  to  the  system  that  would  directly  affect  performance 
here.  The  important  thing  is  that  system  performance  did  not  degrade.  The  same  can 
be  said  for  the  missed  sonorant  regions.  Although,  an  effort  was  made  to  improve 
performance  in  this  area,  it  was  unsuccessful. 

The  number  of  missed  sonorant  regions  dropped  significantly.  This  was  due  mostly 
to  the  changes  in  the  initial  processing  of  the  acoustic  front  end.  The  remaining  unde¬ 
tected  sonorant  regions  were  very  short  and  had  low  energy  but  still  could  be  perceived 
as  syllables  to  human  listeners. 

The  number  of  incorrect  insertions  also  dropped.  There  were  two  effects  going  on 
in  this  case.  The  sonorant  detector  defined  more  regions  as  sonorant  than  it  did  before 
because  of  its  increased  sensitivity.  That  increased  the  number  of  false  insertions.  On 
the  other  hand,  fewer  valid  sonorant  regions  were  being  broken  up,  driving  the  number 
of  false  insertions  down.  This  was  the  dominating  effect,  bringing  the  total  number  of 


insertions  down. 


The  number  of  words  with  incorrect  stress  assignment  also  dropped  slightly.  This 
was  due  to  the  fact  that  sonorants  had  different  boundaries  than  before,  hence  the 
measurements  for  those  regions  were  different.  This  was  the  only  effect  taking  place 
since  there  were  no  computing  changes  made  to  the  system.  It  was  found,  however, 
that  in  every  case  of  bad  stress  assignment,  the  stressed  syllable  was  always  ranked 
second,  showing  that  the  system  was  close. 

In  the  course  of  investigating  these  results,  it  was  found  that  most  problems  could 
be  corrected  interactively.  This  indicates  that  system  performance  might  be  able  to 
benefit  from  some  sort  of  time  varying  evaluations  on  a  frame  by  frame  basis. 


4.6  Summary  of  System  Improvements 

The  changes  made  to  Aull’s  system  led  to  several  improvements.  These  improve¬ 
ments  are: 

•  Run  Time  Performance  -  The  speed  of  the  system  through  improvements  in  code 
efficiency  and  hardware  changes  decreased  running  time  three  fold  to  about  95 
times  real  time. 

•  System  Flexibility  -  Through  code  changes,  the  system  was  made  easier  to  use 
and  change  interactively. 

•  Syllable  Detection  -  The  system  detected  syllables  more  accurately  through 
changes  that  improved  identification  of  sonorant  regions.  In  addition,  the  in¬ 
sertions  of  spurious  regions  into  the  middle  of  valid  vowel  regions  is  reduced.  In 


these  two  areas,  me  numoer  oi  errors  was  reaucea  irom  so  ana  irom 
(40%  and  11%)  respectively. 

System  performance  did  not  degrade  in  any  way  as  a  result  of  these  changes 
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Chapter  5 


Conclusions 


5.1  Summary 

In  this  thesis,  lexical  stress  was  described  and  its  potential  utility  in  automatic 
speech  recognition  was  outlined.  The  stress  is  a  perceived  quality  measure  of  a  syllable. 
Acoustically  stress  can  be  determined  primarily  through  four  parameters:  spectral 
energy,  duration,  fundamental  frequency  and  spectral  quality.  The  lexical  stress  pattern 
of  a  word  is  useful  to  determine  in  an  automatic  recognition  system  because  it  reduces 
the  search  space  of  possible  candidates. 

Next,  the  system  that  Aull  developed  for  her  Master’s  Thesis  was  investigated.  The 
system  was  made  up  of  two  main  parts:  A  syllable  detector  and  a  stress  determiner. 
The  syllable  detector  was  composed  of  an  acoustic  front  end  and  a  series  of  intervocalic 
consonant  detectors.  The  stress  determiner  took  an  M-dimensional  feature  vector  of 
each  sonorant  region  and  compared  it  to  a  maximum  feature  vector  and  from  that  the 
syllables  were  ranked. 
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Then  the  problems  that  existed  in  Aull’s  system  were  explained.  These  included 
poor  performance  on  some  intervocalic  consonants,  on  Vowel- Vowel  transitions  and  on 
low  amplitude  sonorants. 

Finally  this  author’s  changes  to  the  system  were  described.  These  changes  to  Aull’s 
system  did  improve  its  performance.  The  system  ran  approximately  three  times  faster, 
had  improved  sonorant  detection,  had  fewer  false  insertions  and  was  more  flexible.  The 
number  of  missed  vowel  regions  decreased  by  40%  and  the  number  of  false  insertions 
into  sonorants  decreased  by  11%.  Other  regions  were  not  improved,  as  indicated  by 
the  evaluation  data,  but  in  no  case  did  system  performance  deteriorate. 

5.2  Suggestions  for  Future  Research 

There  are  many  ways  to  further  improve  on  the  work  done  so  far  on  this  system. 
Many  of  the  parameter  values  used  in  the  system  (that  have  been  changed  by  this 
author)  can  still  be  improved  on  using  statistical  tools  and  knowledge  of  speech  signals 
and  production.  A  different  method  for  detection  of  intervocalic  effects,  also  utilizing 
more  speech  knowledge,  could  also  be  incorporated.  More  improvements  in  code  effi¬ 
ciency  could  be  made,  to  be  sure.  An  algorithm  for  assigning  probabilistic  values  to 
stress  rankings  would  be  quite  useful  for  making  the  system  suitable  for  incorporation 
to  an  isolated  word  recognizer. 
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ABSTRACT 

This  paper  describes  an  implementation  of  a  robust 
method  of  lexical  access  and  a  detailed  phonetic  verification 
component  for  recognizing  continuous  digits  using  a  broad 
phonetic  approach.  The  lexica]  access  component  uses  a 
scoring  met  hod  which  takes  into  account  soft  labeling  errors 
due  to  input  signal  variability.  Verification  is  based  on 
the  use  of  a  small  set  of  detailed  acoustic  features  which 
characterize  phone  hypotheses.  Evaluation  of  the  lexical 
access  method  on  a  database  of  74  new  random  length 
digit  strings,  each  spoken  by  5  new  speakers,  shows  the 
method  to  be  tolerant  to  front-end  errors  and  variations 
in  pronunciation.  Evaluation  of  the  verification  component 
indicates  that  use  of  a  few  detailed  phonetic  features  is 
adequate  for  verification  of  phones  in  the  digit  vocabulary. 


INTRODUCTION 

In  K5ASSP-82,  Shipman  and  Zue  [1]  showed  that  a 
broad  phonetic  representation  imposes  strong  sequential 
constraints  on  words  in  the  English  language.  They  then 
proposed  an  isolated  word  recognition  model  which  uses  the 
constraints  provided  by  a  broad  phonetic  representation.  In 
their  model,  the  speech  signal  is  segmented  and  classified 
into  several  broad  categories  which  can  be  determined 
reliably.  Next,  indexing  into  the  lexicon,  only  words  which 
match  the  sequence  of  broad  phonetic  labels  remain  as 
contending  word  candidates.  Finally,  the  contendihg  words 
are  examined  using  detailed  phonetic  analysis  to  identify 
the  input  utterance. 

Chen  and  Zue  (2]  extended  Shipman  and  Zue's  isolated 
word  recognition  model  to  continuous  speech  and  showed 
that  strong  lexical  constraints  at  the  broad  phonetic  level 
can  be  exploited  in  a  continuous  digit  recognition  task. 
To  illustrate  that  the  approach  is  viable,  a  broad  phonetic 
classifier  and  lexical  access  component  were  implemented. 
Testing  on  1718  digits  by  5  speakers,  the  correct  digit  was 
not  one  of  the  lexical  candidates  only  1%  of  the  time.  While 
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the  results  were  encouraging,  this  initial  implementation 
suffered  in  one  important  respect:  The  implementation  did 
not  provide  flexibility  in  accommodating  similar  but  new 
acoustic  realizations  of  a  word.  Instead,  new  pronunciations 
were  accomodated  by  explicitly  adding  them  to  the  lexicon. 
In  other  words,  a  digit  was  considered  a  candidate  only 
if  the  input  string  was  a  pronunciation  supplied  by  the 
lexicon. 

In  the  current  study,  two  aspects  of  the  broad  phonetic 
recognition  model  were  focused  on.  First,  Chen  and  Zue’s 
work  was  extended  in  an  effort  to  develop  a  more  robust 
method  of  lexical  access  which  could  tolerate  reasonable 
“errors”  by  the  broad  phonetic  classifier.  Second,  a 
preliminary  examination  of  verification  of  word  hypotheses 
based  on  detailed  phonetic  features  was  performed. 

LEXICAL  ACCESS 

Researchers  (e.g.  |3]  and  [4])  have  developed  systems 
which  perform  lexical  access  and  recognition  directly  from 
a  phonemic  sequence.  In  contrast,  this  study  is  based 
on  the  belief  that  a  more  robust  recognition  method  is 
to  perform  lexical  access  by  scoring  how  well  the  broad 
phonetic  representation  of  an  unknown  utterance  matches 
the  phonetic  representation  of  a  word  in  the  lexicon. 
Since  less  detailed  distinctions  are  needed  to  produce  a 
broad  phonetic  representation  than  a  detailed  phonetic 
representation,  one  should  be  able  to  compute  a  broad 
phonetic  representation  with  less  error. 

In  the  broad  phonetic  recognition  model  (Figure  1), 
the  broad  phonetic  classifier  produces  a  broad  class 
segmentation  string  of  the  incoming  signal.  The 

segmentation  string  may  be  composed  of  six  possible  labels: 
weak  fricative,  strong  fricative,  short  voiced  obstruent, 
vowel,  sonorant.,  and  silence.  The  lexical  component 
matches  the  phonetic  representation  of  each  word  in  the 
lexicon  against,  the  broad  class  segmentation  produced  by 
the  system,  yielding  a  lattice  of  word  candidates. 

Although  a  broad  phonetic  representation  is  more 
robust,  than  a  detailed  representation,  unanticipated 
acoustic  realizations  do  occur,  resulting  in  classification 
errors  at  the  broad  phonetic  level.  For  example,  the  closure 
in  a  stop  gap  may  be  incomplete,  resulting  in  a  “noisy" 
stop  gap  which  is  labeled  as  a  “weak  fricative".  A  lexical 
access  component  was  implemented  which  attempts  to 
handle  these  labeling  en-ors  using  two  tapes  of  knowledge: 
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to  the  next  phone  label;  this  is  represented  by  Path  AC. 
If  an  extra  label  is  created  by  the  front-end,  an  insertion 
occurred;  this  is  represented  by  Path  AB.  And  if  the 
broad  phonetic  classifier  labels  two  sequential  phones  as 
the  same  broad  phonetic  class,  a  deletion  occurred;  this  is 
represented  by  Path  AD. 

referenrr 


Figure  1:  Broad  phonetic  recognition 
model 

1)  how  often  a  phoneme  is  mislabeled  as  another  class — 
for  example,  how  often  a  /k /  closure  is  labeled  as  a 
“weak-fricative”  instead  of  “silence”  and  2)  how  often 
an  insertion  or  deletion  occurs  in  mapping  a  word’s 
broad  class  representation  to  a  phonetic  representation — 
for  example,  the  frequency  with  which  /s/  and  /n/  in 
the  sequence  /sn/  (as  in  “six  nine”)  are  both  labeled  as 
“strong-fricative,”  due  to  the  fact  that  the  initial  nasal 
in  that  context  may  be  deleted  or  extremely  short.  By 
using  these  types  of  knowledge  about  the  characteristics 
of  the  segmentation  strings  produced  by  the  front-end, 
the  lexical  component  allows  for  acoustic  variations  in 
a  phone.  Furthermore,  many  alternate  broad  phonetic 
representations  of  a  word  needed  with  the  explicit  matching 
method  become  unnecessary. 

The  lexical  component  assigns  a  score  reflecting  how 
well  the  phonetic  representation  of  a  word  matches  a 
portion  of  the  segmentation  string,  using  knowledge  about 
the  characteristics  of  the  broad  phonetic  classifier’s  output. 
For  example,  the  broad  phonetic  classifier  may  label  /9j 
as  “weak  fricative"  60*^  of  the  time  and  “strong  fricative” 
40°c  of  the  time.  Knowing  this,  the  lexical  component  does 
not  penalize  the  score  much  when  matching  /9/  to  “strong 
fricative."  In  contrast,  if  /9/  is  never  classified  as  “vowel” 
during  training,  then  the  match  of  /8/  to  “vowel”  would  be 
assigned  a  poor  score.  Insertions  and  deletions  are  handled 
by  using  transition  probabilities.  If  the  broad  phonetic 
classifier  consistently  misses  prevoralic  nasals,  as  in  the 
word  “nine",  then  the  system  will  know  that  very  often 
the  /n/,  as  well  as  the  /ay/,  is  labeled  as  “vowel”.  This  is 
reflected  by  a  high  transition  probability  of  matching  /n/ 
to  “vowel"  and  then  matrhing  /a1'/  to  “vowel”. 

A  forward  dynamic  programming  algorithm  finds  the 
best  match  between  the  broad  phonetic  and  phonetic 
strings.  Simple  slope  constraints  require  the  path  to  be  non¬ 
decreasing  in  each  direction.  In  contrast  to  the  constraints 
used  in  dynamic  time  warping  of  the  speech  signal,  many 
phonetic  labels  may  map  into  a  single  broad  phonetic 
segment.  For  example,  the  /I/,  jtj  and  /ow/  in  “zero”  may 
map  into  the  label  “vowel"  if  the  broad  phonetic  classifier 
has  no  knowledge  for  differentiating  among  these  sounds. 

The  allowed  paths  from  a  sample  node  are  illustrated 
in  Figure  2.  Each  node  represents  the  match  between  a 
broad  phonetic  label  and  a  phone.  The  sequence  of  broad 
phonetic  labels  (reference)  aligns  with  the  nodes  from  left 
to  right;  and  the  sequence  of  phones  (test)  aligns  with  the 
nodes  from  top  to  bottom.  Three  paths,  or  transitions,  exit 
from  a  typical  node,  here  labeled  “A".  When  no  insertion 
or  deletion  occurs,  the  next  broad  class  segment  is  matched 


Figure  2:  Paths  used  in  the  dynamic 
programming  algorithm 

The  total  accumulated  score  to  node  C,  dp ,  is: 

dc  =  d*  +  log(Pr(pc,  lc)  *  K’c| 

where  d ^  is  the  total  accumulated  score  to  node  A. 
Pr (pc,lc)  is  the  probability  of  labeling  the  phone  at  node 
C,  pc,  as  the  broad  class  label  lc-  ll’c  is  the  probability  of 
making  a  transition  from  node  A  to  C,  given  that  node  A 
is  the  current  state  and  nodes  B,C,  and  D  are  states  which 
may  be  entered  from  node  A.  IV<?  is  computed  as: 

_ _ _ P'IpaIa  -  >><;[<■)__  . 

I’r(Pa'/t  -  Pulp)  +  Pr(/>Va  •  p<  b  j  4  I’rir.sk  “  Pi'hA 

W p  and  Wp  are  computed  similarly  and  represent, 
respectively,  the  probability  of  inserting  and  delrting  a 
segment. 

The  “best"  alignment  between  the  phonetic  string 
/zIrow/  and  the  broad  phonrtir  representation  “strong- 
fricative  vowel”  is  shown  on  the  left  of  Figure  3;  the 
associated  match  and  transition  probabilities  are  shown  on 
the  right.  A  phonetic  string  is  assigned  the  score  of  the  best 
path,  normalized  by  the  number  of  transitions  in  the  path. 
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Figure  3:  Alignment  of  /zlro*/  with 
“strong-fricative  vowel” 

This  method  of  lexical  access  was  evaluated  on  a 
database  of  digit  strings  ranging  in  length  from  one  to 
seven.  The  database  was  subdivided  into  training  and  new 
sprakers  and  into  training  and  new  sentences,  resulting  in 
four  mutually  exclusive  subsets  as  shown  in  Table  1. 
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Table  1:  Corpus  Subsets 


Total  #  of 
Utterances 

Speakers 

Total  # 
of  Digits 

262  training 

3  male,  3  female:  training 

1365 

152  training 

3  male,  2  female:  new 

599 

370  new 

2  male,  3  female:  training 

1440 

370  new 

3  male,  2  female:  new 

1440 

Each  broad  class  segment  produced  by  the  broad 
phonetic  classifier  was  used  as  the  beginning  segment  of 
each  word  hypothesis  and  the  scores  of  all  possible  matches 
were  computed.  The  distributions  of  scores  for  correct 
words  and  for  all  word  hypotheses,  evaluated  on  new 
utterances  by  new  speakers,  are  shown  in  Figure  4  as 
dashed  and  solid  lines,  respectively.  Note  that  the  log 
probability  scores  of  the  correct  words  are  much  closer  to 
0.  or  a  probability  of  1,  than  the  bulk  of  the  scores  of  all 
possible  words.  The  distributions  indicate  that  a  word  score 
threshold  ran  be  set  such  that  all  words  with  a  score  below 
the  threshold  can  be  ruled  out  as  viable  candidates. 


Figure  4:  Histograms  of  correct  and 
incorrect  word  scores 

Figure  5  illustrates  the  relationship  between  the  amount, 
of  pruning  achieved  compared  to  the  percentage  of  correct 
words  pruned  when  evaluated  on  new  utterances  by  new 
speakers.  Note  that,  one  can  reduce  the  number  of 
hypothesized  words  by  50°^  without  pruning  any  of  the 
correct  words.  The  curves  for  training  and  new  speakers 
were  found  to  be  similar  (5|,  indicating  that  the  method  is 
potentially  robust  to  speaker  variabilities. 

VERIFICATION 

In  the  broad  phonetic  recognition  model,  the  input  to 
the  verifier  is  a  lattice  of  word  candidates  produced  by 
the  lexical  component,  the  most  unlikely  candidates  having 
been  removed.  The  verifier  selects  the  best  word  or  string 
of  words  from  among  the  competing  word  candidates  using 
a  set  of  detailed  acoustic  features. 

Each  word  hypothesis  is  represented  as  a  sequence 
of  phones  and  each  phone  is  characterized  by  a  set  of 
detailed  acoustic  features.  This  choice  of  representation 
was  motivated  by  linguistic  reasons  and  by  the  desire  for 
ejctendability  to  other  recognition  tasks. 

Observations  of  phone  characteristics  in  spectrograms 
were  used  to  select  a  small  set  of  nine  acoustic  features. 
These  features  were  designed  to  rapture  salient  acoustic 
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Figure  5:  Pruning  of  all  word  hypotheses 
versus  correct  word  hypotheses 

characteristics  of  speech  sounds  and  detailed  differences 
between  similar  phones  in  the  digits.  The  features  are: 

•  Position  of  the  first  three  formants  and  movement  of 
the  first  two  formants:  In  an  effort  to  achieve  a  robust 
characterization  of  formant  motion  and  position,  a 
gross  characterization  based  on  spectral  weights  was 
used,  rather  than  a  formant  tracker  which  can  exhibit 
inconsistent  behavior  in  nasalized  regions.  The  spectral 
weights  emphasize  energy  in  specific  regions  of  the 
spectrum. 

•  Nasal  possibility:  To  detect  the  presence  of  the  low 
frequency  resonance  characteristic  of  nasal  murmurs, 
this  feature  compares  the  energy  in  a  passband  of  100- 
350  Hz  to  energy-  in  a  passband  of  350-850  Hz. 

•  Onset  rate:  This  feature  is  the  maximum  change 
in  energy  from  1000-7000  Hz  within  20  msec  of  the 
beginning  of  a  phone.  To  capture  rapid  transitions, 
the  energy  is  computed  every  msec  from  the  short  time 
Fourier  transform  using  2  msec  Hamming  window. 

•  Spectral  offset  location:  This  feature  represents  the 
location  of  the  first,  spectral  dip  higher  in  frequency 
than  the  first  major  concentration  of  energy  in  a 
smoothed  spectrum. 

•  High  frequency  energy  change:  This  feature  is  the  slope 
of  the  best,  linear  fit  to  the  energy  in  the  4500  7800  Hz 
band  over  the  duration  of  a  phone.  This  feature  is 
intended  to  help  differentiate  between  fricatives  (which 
have  relatively  stable  energy)  and  unvoiced  plosive 
releases  (which  generally  have  a  strong  onset  followed 
by  aspiration  which  weakens). 

Hypotheses  scoring  can  be  viewed  as  a  discrimination 
or  identification  problem.  A  binary  discrimination  allows 
small  differences  between  similar  candidates  to  be  weighed. 
In  contrast,  identification  indicates  how  well  the  measured 
feature  values  match  the  expected  values  for  a  phone, 
independent  of  the  values  for  the  other  phones.  Because 
lexical  access  based  on  a  broad  phonetic  representation 
results  in  similar  sounding  word  candidates,  the  sounds  to 
be  scored  should  be  similar;  hence  discrimination  seems 
the  better  approach.  Preliminary  results  bear  out  this 
expectation,  and  a  metric  based  on  discrimination  between 
competing  phones  was  used  in  scoring  [5], 

To  identify  errors  due  to  the  verification  algorithm, 
the  inputs  were  idealized  by  mapping  the  phonetic 
transcription  of  each  utterance  into  a  broad  phonetic 
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transcription  and  then  performing  lexical  access  on  these 
“ideal"  broad  phonetic  transcriptions.  The  verifier  was 
evaluated  only  on  the  subset  of  the  lexical  access  database 
which  was  phonetically  transcribed. 

Table  2  shows  the  word  error  rates  under  various  test 
conditions.  Each  insertion,  deletion,  or  substitution  was 
counted  as  an  error.  The  error  rates  illustrate  the  power 
of  using  a  few  carefully  selected  acoustic  features  combined 
with  statistical  measures  to  score  each  contending  phone. 
On  new  utterances,  the  error  rate  for  training  speakers 
is  only  slightly  better  than  for  new  speakers,  indicating 
that  an  acoustic-phonetic  approach  is  potentially  speaker- 
independent. 


Table  2:  Word  Error  Kates 


Utterances 

Speakers 

/<  of 

Speakers 

»  of 
Digits 

Word 

Error  Rate 

training 

training 

C 

1 305 

1.5% 

training 

now 

3 

301 

-i.9% 

new 

training 

■1 

1120 

5.0% 

now 

now 

•1 

893 

5.3% 

Detailed  analysis  of  the  errors  in  all  corpora  revealed 
that  many  of  the  errors  were  due  to  differences  in 
male/female  speech.  The  most  striking  and  consistent  error 
was  the  confusion  of  “four"  and  “five".  All  16  cases  in 
which  “five"  was  incorrectly  recognized  as  “four"  occurred 
in  speech  by  males.  Eighteen  of  the  19  cases  in  which 
“four"  was  confused  as  “five”  occurred  in  speech  spoken 
by  females. 

To  obtain  an  indication  of  the  robustness  of  the 
verification  scores,  the  score  of  the  correct  word  relative 
to  the  score  of  competing  candidates  was  examined.  When 
the  top  candidate  was  correct,  its  score  was  compared  to 
the  second  best  candidate’s  score.  When  the  top  candidate 
was  incorrect,  its  score  was  compared  to  the  correct  word  s 
score.  Figure  6  shows  the  results  of  evaluation  on  new 
utterances  by  new  speakers.  Note  that  the  difference  in 
word  scores  is  generally  small  when  an  incorrect  word  is  the 
best  scoring  word  (dashed  line),  and  that  the  difference  has 
a  large  range  when  the  correct  word  is  the  best,  scoring  word 
(solid  line).  In  a  recognition  system,  this  information  could 
be  used  to  identify-  words  which  do  not  score  much  better 
than  their  competitors.  Finer  analyses  could  be  performed 
on  these  words  or  the  word  could  be  rejected. 


Figure  6:  Difference  in  word  scores  for 
correct  and  incorrect  classification 


The  rank  of  the  score  of  each  phone  in  the  correct  word 
is  shown  in  Table  3.  Note  that  for  new  utterances  by  both 
the  training  speakers  and  the  new  speakers,  the  correct 
phone  is  in  the  top  position  at  least  86%  of  the  time  and 
w-ithin  the  top  two  candidates  at  least  98%  of  ihe  time  This 
similarity  in  rank  again  indicates  the  potential  speaker- 
independence  of  using  acoustic  features  in  verification. 


Table  3 
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SUMMARY 

Two  components  of  a  broad  phonetic  based  continuous 
digit,  recognition  system  have  been  examined  A  method 
for  lexical  access  was  implemented  and  shown  to  allow 
a  recognition  system  to  tolerate  reasonable  front-end 
variations  in  labeling.  The  nse  of  a  small  set  of  fine  phonetic 
features  for  word  verification  was  investigated  and  found 
to  be  adequate.  Additionally,  evaluation  showed  these 
components  to  be  potentially  robust  to  speaker  variations. 
These  results  are  encouraging  and  indicate  that  a  broad 
phonetic  approach  is  viable,  but  evaluation  should  now  be 
performed  on  a  larger  database. 
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ABSTRACT 

It  has  been  shown  that  bread  phonetic  sequences  partition  a 
large  lexicon  into  small  equivalence  classes  of  words  sharing  the 
same  sequence  While  these  results  illustrate  the  power  of  broad 
phonetic  constraints  for  differentiating  words  from  one  another, 
t  he>  do  not  suggest  how  to  exploit  sequent lal  const  raints  in  recog- 
mtion  This  paper  presents  a  method  for  decoupling  sequential 
phonetic  constraints  from  a  lexicon,  by  representing  allowable 
broad  phonetic  sequences  in  terms  of  n-th  order  Ma*kov  models 
A  simple  frame-based  broad  phonetic  classifier  is  used  to  evalu¬ 
ate  the  effectiveness  of  these  models  m  recognition  Tests  on  300 
sentences  from  30  male  speakers  demonstrate  that  the  addition 
of  sequential  constraints  improves  the  classifier's  performance 


INTRODUCTION 

We  have  been  investigating  the  use  of  broad  phonetic  se¬ 
quences  for  hypothesizing  words  in  speech  recognition  j  1 1  Ship- 
man  and  Zue  2  demonstrated  that  broad  phonetic  sequences  are 
powerful  for  discriminating  among  the  words  in  a  large  lexicon. 
They  showed  that  a  large  lexicon  can  be  partitioned  into  small 
equivalence  classes  by  representing  the  words  in  the  lexicon  in 
terms  of  sequences  of  six  manner  of  articulation  labels  For  the 
20. 000- word  Webster's  Pocket  Dictionary,  there  are  an  average  of 
approximately  35  words  matching  each  broad  phonetic  sequence 
The  largest  equivalence  class  has  about  200  words,  or  of  the 
lexicon 

A  partitioned  lexicon  forms  a  table  of  words  corresponding 
to  each  broad  class  sequence  In  the  case  of  ideal  data,  a  sequence 
recognized  in  the  speech  signal  can  be  used  to  lookup  the  possi¬ 
ble  matching  words  in  the  table  This  presumes  that  the  word 
boundary  is  known,  and  hence  applies  most  directly  to  isolated 
word  recognition  The  variability  in  real  speech  dat a  complicates 
this  simple  access  mode). 

Our  previous  research  has  focused  on  developing  a  lexiral 
representation  which  is  relatively  insensitive  to  variability  This 
work  is  summarized  in  the  next  section  The  current  paper 
presents  a  method  for  using  sequential  phonetic  constraints  to 

reduce  the  variability  in  a  broad  phonetic  classifier  To  evaluate 
this  method  we  implemented  a  simple  frame-based  broad  pho¬ 
netic  classifier  and  tested  it  both  with  and  without  the  sequential 
ph  onetic  constraints 


LEXICAL  ACCESS 

The  high  degree  of  variability  in  speech  means  that  a  given 
recognized  sound  sequence.  V,  can  correspond  to  many  possible 
sequences.  f*os{S).  in  the  lexicon  The  size  of  Iyos[S)  depends 
on  both  the  degree  of  variability  in  the  speech,  and  the  errors 
introduced  by  the  acoustic  classifier 

To  find  all  the  possible  words  given  sequence  5.  either  the 
lexicon  must  be  probed  once  for  each  sequence  in  /’n>{  S )  or  each 
word  must  be  stored  according  to  all  its  possible  realizations 
Therefore,  in  order  to  reduce  t  lie  number  of  word  candidates 
corresponding  to  5>*  it  ;s  necessary  to  minimize  the  size  of 
This  can  be  done  in  two  ways  (1)  reduce  t  lie  sensitivity  of  the 
lexical  representation  to  variability,  and  (2)  reduce  the  variability 
in  the  output  of  the  bmad  phonetic  classifier 

The  key  observation  in  making  the  lexical  representation  less 
variable  is  that  the  variability  in  speech  is  not  uniform  For  in¬ 
stance.  the  stressed  syllables  of  words  are  less  variable  than  the 
unstressed  syllables  This  is  illustrated  by  the  fact  that  deletion 
of  phonetic  segments  occurs  almost  exclusively  in  unstressed  syl¬ 
lables  Thus  for  two  identical  broad  (lass  sequences  S€  and  .S’,,, 
recognized  from  stressed  and  unstressed  syllables  respectively, 
the  first  will  have  fewer  possible  underlying  sequences  than  tin* 
second  Po*(Sf)  • 

In  order  to  evaluate  a  representation  based  on  stressed  syl¬ 
lables,  we  compared  the  relative  importance  of  stressed  and  un¬ 
stressed  syllables  in  partitioning  a  large  lexicon  3  This  inves¬ 
tigation  revealed  that  the  phonemes  in  stressed  syllables  alone 
provide  almost  as  much  constraint  as  the  entire  word  the  size 
of  the  lexical  equivalence  classes  is  almost  the  same  for  represen¬ 
tations  using  only  the  stressed  syllables  as  for  those  using  the 
whole  word  For  representations  using  only  unstressed  syllables, 
on  t  he  ot  her  hand .  the  size  of  the  equivalence  classes  is  t  w  o  orders 
of  magnitude  larger  These  results  strongly  suggest  that  the  lex¬ 
ical  representation  should  be  based  on  the  phonemes  m  stressed 
syllables 

The  second  way  of  minimizing  t  he  size  of  f*os[S )  is  to  reduce 
the  variability  in  the  output  of  the  classifier  The  remainder  of 
this  paper  investigates  how  to  use  sequential  phonetic  constraints 
to  reduce  the  variability  in  the  output  of  a  broad  phonetic  clas¬ 
sifier  Since  sequential  phonetir  constraints  are  implicit  in  the 
words  of  a  given  lexicon,  they  must  be  decoupled  from  the  lexi¬ 
con  before  they  ran  be  used  in  a  classifier 


42.  9.  1 


ICASSP  X6,  TOKYO 


*  '1 12243  4  0000  2259  Si. 00  '  I  OKU  IEEE 


2259 


decoupling  the  c  onstraints 

This  serf i( III  investigates  representing  the  sequential  pho¬ 
netic  constraints  of  Knglish  explicitly  in  terms  of  allowable  «- 
tuples  of  broad  phonetic  segments  To  t lie  extent  that  this  repre¬ 
sentation  is  independent  of  any  particular  lexicon,  it  can  be  said 
to  capture  general  sequential  properties  of  Knglish 

Sequential  phonetic  constraints  are  relatively  local  For  ex¬ 
ample.  Knglish  has  the  word  initial  sequences  spl  and  spr  , 
but  not  spt  At  a  broad  phonetic  level  (using  t he  six  manner 
<>f  articulation  classes  vowel,  nasal,  liquid  or  glide,  stop,  strong 
fricative,  ami  weak  fricative)  this  rule  can  be  characterized  as 
[STRONG-FRIC] [STOP] [LIQUID] 
is  all* >w able  but 

[strong-fric]  [stop]  [stop] 

is  hot  The  Km  ality  of  such  rules  implies  that  a  first  or  second  or¬ 
der  -'harai  teri/aWon  of  legal  sound  sequences  should  be  sufficient 
fi  r  tapturing  sequential  phonetic  constraints 

A  -tli  Order  Models 

<  1 1  v  «•  ii  the  locality  of  sequential  phonetic  constraints,  wc 
can  use  the  n-th  order  sequences  (for  rt  *  2)  in  a  large  corpus 

to  construct  a  model  of  legal  broad  phonetic  sequences  'Flic 
states  of  I  lie  model  are  n-tuples  of  broad  phonetic  segments, 
arid  the  transitions  are  single  segments  A  transition  from  state 
{ j , ,  i:.  .  j„  )  to  s»  ate  ( j2.  .  t..  .  jf )  occurs  on  input  jq .  where 

the  /,  are  broad  phonetic  segments 

For  a  broad  phonetic  scheme  such  as  the  one  we  have  been 
using,  constructing  these  models  js  relatively  easy  because  of  t he 
small  number  of  symbols  A  third  order  characterization  of  a  six 
symbol  system,  surh  as  the  manner  of  articulation  classification 
used  by  Shipman  ami  Zue.  has  only  216  possible  states  For  a 
more  detailed  representational  scheme,  with  forty  or  fifty  sym¬ 
bols.  the  number  of  possible  states  rapidly  becomes  intractable 
A  given  model  can  fie  formed  by  observing  the  broad  pho¬ 
netic  class  sequences  m  a  particular  lexicon  for  example,  the 
one-word  lexicon  cast',  with  the  phoneme  string  ka*st  and 
the  broad  phonetic  sequence 

[STOP] [VOCALIC] [ STRONG- KRICl [STOP] 
generates  a  second  order  model  with  three  states  and  two  transi¬ 
tions  However  tins  model  does  not  capture  the  legal  sequences 
at  the  beginnings  ami  ends  of  words  I  herefore  we  make  use  of 
two  additional  classes  [BEG]  and  [END]  which  triark  before  and 
after  a  word  l  'hiik  t  hese  two  add  it  tonal  classes,  t  he  model  show  n 
m  Figure  I  is  obtained  for  the  one-word  lexicon,  cast 

srop  'oc  fric  slop 

A  A-  A  > 

ji  .  q  -  'Ol  (•<>»  i|fn.  |  l,f’dl'<”P|.  .  fiMpUi.ldli 

figure  |  Ser  nnd  ord«*r  of  a  one-word  lexicon 

To  drterrnirie  how  well  these  models  rapt  lire  broad  phonetir 
'  ■  >n-t r-nrits  independent  of  a  given  lexicon,  we  compared  models 
of  different  lexicons  If  a  model  of  one  lexicon  recognizes  the  se- 
que n-  ir  otl»er  lexicons,  then  it  has  rapt ured  general  properties 
of  f  ngli-li  s.  ufjfj  seqijrnres  rat  her  than  sped  fir  properties  of  the 

42. 


lexicon  Se<(  n  d  and  third  order  models  of  the  I  \  u  ket  1  b<  t  ion  a  r  y 
of  2f,.(HMi  words,  and  l.orge  and  T  horndike  s  T..VH1  iiio-i  frequent 
Knglish  words  were  ronq.ired  I  he  models  uu  the  same  my 
manner  of  arte  ul  at  u  n  *  las^e^  as  t  fir  hui.n  st  udu  -  l  h»  -  i  aid 
order  model  of  t  lie  X  »tHi  word  b-\ii-un  »  orrei  i  ly  re<  igni/#--  ‘i'i  , 
of  t  he  words  in  the  V  ’  '  "  ,n  word  K  \e  •  >n  The  third  order  model 
correctly  recognize**  of  the  w  r  Is  Th«s  sirmigly  supports 

t  he  fad  t  fiat  t  In*  models  a r*  ueh  p«  ndc  m  .  >f  a  r  i  v  fti  K  x  i>  » >n 

In  add  it  ion  I .  i ».  j  ify  i  nr  all  w  d  K  >.  f  h  order  br  *ad  j  Ii.  -nd  i< 

sequeric  es.  the  Me’  w  •  ;f  k  s  *  an  be  u  -•  d  t  < 1  eru  o  fi  the  b  kel  i  le  >•  >d 

of  occurrence  for  e.a-  h  sr.jinrue  I  his  jv  j  hv  aug  nidit  trip 
the  arcs  ,,f  the  network  wuh  trui-Miori  pn  T.d  ilit  ies  I  unp  a 
lexicon  with  word  fr  equ«  n<  us  i  he  Id  *  1 1  ho«  •<  j  if  a  given  t  ratisi  t  ion 
propoM  tonal  to  1 1n  frcqu*nr\  of  tin  word-  m  w  he  li  it  «■<  c  urs 
(  bir  lexn  oils  all  ha v  e  w •  >r  1  fr«- q uen  \  i n format  e  *n  fr  un  t  he  Hrow  n 

Corpus  of  written  f ■  nglisb  1  U  it  h  t  he  addn  i  ui  of  transit  ion 

probabilities,  the  networks  form  u  th  order  MarKo  models  of 
the  sequential  phonetic  constraints 

APPLYING  THE  CONSTRAINTS 

The  goal  of  incorporat ing  sequeiiiial  phonetir  constraints 
into  a  <  lassif'nu  is  to  reduce  t  he-  variability  in  the  classifier  s  out¬ 
put  To  evaluate*  the  effectiveness  of  the  models  dew  loped  in  t  he 
previous  section  we  implemented  a  simple  broad  phonetic  classi¬ 
fier  (»iven  the  output  of  the  classifier,  a  Markov  model  is  used 
to  determine  t ho  best  broad  phonetic  sequence  consistent  with 
that  model  Figure  2  diagrams  the  relation  between  the  classi¬ 
fier  and  the  model  In  effect  the  sequential  phonetir  constraint* 
of  the  model  are  used  to  correct*  the*  output  of  the  classifier 
Comparing  the  best  sequences  for  different  models  serves  as  a 
paradigm  for  evaluating  the  power  of  the  models 

Speech  Signal 

I 

. _ T _ , 

(  l.iwificr 

Sequence 

* 

►  Rest  '-cquence 

▼ 

Sequence 

Figure  2  Paradigm  fin  ««>mp*rinp  different  Mark-w  ne-drdv 

of  sequential  ph<>neit<  ron«traint* 

T  he  n-th  or<ler  Markov  model  of  a  given  lexicon  capture*- 
sequential  phonetic  const  rauits  m  terms  of  broad  plionetn  srg. 
merits  Since  the  classifier  perform*-  frame-by -frame  classification 
of  the  speech  signal,  the  segment -ba^ed  network*-  -»f  the  previous 
section  must  b<»  convert *d  to  fraim -based  networks 

E.vh  art  in  a  segment  based  network  correspond*-  to  a  givwi 
broad  phonetir  segment,  as  illustrate*!  m  figure  1  To  r  invert 
tills  network  into  a  frame-based  network  .  ear  h  arc  is  rr  placed  by 
a  recognizer  for  its  corresponding  s»-grnenl  This  m  pqnent  re.  ■ 
ognizer  models  a  segment  as  on#-  or  more  sii'«e-s|\r  frames  o| 
the  same  tvpe.  as  sfiown  in  figure  ?>  The  s,  |f.  t  ramii  e  >n  prob- 
abilit  y  .  ft .  is  nbt  amed  by  observing  t  he  du  rat  i"H*  of  ea<  h  brad 
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!  igure  t  KrAim'-bv -frame  model  of  a  broad  phoned*  seg¬ 
ment 

pin 'Ti*' 1 1 r  segment  in  a  training  set  The  small  number  of  broad 
phomue  segment*  makes  robust  training  possible  with  relatively 
lit  t  ie  « i . 1 1  a 

The  same  segment  recognizer  i*  used  regardless  of  what 
states  a  transition  occur*  between  One  obvious  extension  i-  to 
determine  the  degree  to  which  broad  phonetic  context  influences 
the  ar.njstn  realization  of  broad  phonetic  segments  If  context 
predicts  the  acoustic  realizat  ion.  t  lien  different  segment  recogniz¬ 
ers  ran  be  u«ed  in  different  contexts 

The  network  formed  by  replacing  each  segment  arc  with  its 
corresponding  frame-based  recognizer  forms  a  Markov  model  of 
frame  sequences  This  model  captures  both  the  n-th  order  seg¬ 
ment  information  m  the  lexicon  and  the  duration  information  in 
the  t  raining  set 

Finding  the  Best  Frame  Sequence 

Coven  a  sequence  of  frames  generated  bv  the  classifier,  we 
wish  j()  determine  the  best  frame  sequence  given  that  input  and 
an  n-th  order  frame-based  Markov  model  This  can  be  done  using 
the  forward-backward  algorithm  ">  .  and  defining  the  best  frame 
spqurme  to  he  the  highest  probability  spqiicncr  given  the  model 
and  the  input  sequence 

The  highest  probability  sequenre  is  simply  the  highest  prob¬ 
ability  slate  at  each  time  'I  he  highest  probability  state  at  a 
given  time  is  found  by  using  the  fnrvv ard-barkvv ard  algorithm  to 
compute  /V  ( s(  q,  CM.  the  probability  of  being  in  state  q,  at 
time  I  given  the  observed  sequenre  C  This  rompiitation  is  done 
u-'ine  ’he  ronditional  probability 


7,  ') 


and  t  li'1  relat  mn 


/’r(  C  and  f,  q, ) 
l’r[C) 


(!) 


/’ r(  '  and  *,  q, )  o.(i)d,(i ) 

w  here 

0,(1)  !’r{().  O,  and  7,) 

■  M'  1  /’ r(<  >,  .  !  (>1  I  .'I  7- 1 

are  file  /"Ml  ard  an  I  l-nrlcu  at-i  probabilities.  r. -p.  - IivHv  These 

pr ■  1! .a tul iMe-  and  the  del . .  "f  ft)  ran  le  . . 1  ef- 

liMently  [ill  (>{’!■)  lime  f'  r  an  n  stale  nmd.  II  mini!  re>  iirsmn 
formulas 

I  lie  next  s(.(  1 1,,|,  -I.  -'  ril><  s  a  '  l.issilu-f  «  hn  h  ve.  i.,r  quant  i/es 
1  lie  out  pul  of  three  band|  as-  IiI'.-m  into  <  1 1- 1 1 '  \  ""Iruerd- 
I  itf'sr*  codeword  s.  pi»m  *’s  are  input  to  *h»-  f  -rward  ba>  k w  ard 
-tip,  r 1 1  |,n,  ;)!ni,i>  With  tin*  likelihood  of  car  h  broad  ph.»ru*tic  class 
given  a  particular  codeword 


THE  CLASSIFIER 

I  h«*  input  to  the  classifier  is  a  (  rude  spectral  shape  in  tin- 
form  of  three  bandpass  liltercd  ern  rgies  mi  tin*  ranges  n-]n(id  Hz 
inon-'j.ann  Hz  and  2M  111- ■**(>(  tn  Hz.  computed  every  l(<  milltsec ■<  aids 
h'ach  energy  value  is  computed  using  a  'Jn  milhsee  ond  hamtmng 
window  I’he  three  r-nergy  values  at  ea*  h  fraim  are  theft  vector 
quantized  into  one  of  eight  \Q  codeword*  1  his  \Q  codeword 
sequence  is  used  as  input  to  the  forward-backward  algorithm 

The  training  procedure  involve-  three  stages  ’I  he  same  set 
of  training  utterances  is  used  for  all  three  stages  The  fir-t  stag* 
est  i  mates  the  self- 1  r  ansi  t  mn  probabilities.  />, ,  for  the  segment  rec¬ 
ognizers  used  to  construct  (lie  frame-based  network  These  prob¬ 
abilities  are  determined  using  the  durations  of  the  hand-labi  led 
dat a  segments 

The  second  stage  forms  the  vector  quantization  <odeb<>  >k 
The  three  bandpass  energy  values  for  each  frame  of  the  training 
data  are  input  to  a  k -means  c!u«*'-ing  procedure  using  a  f  u- 
clidean  distance  ine-tru  I  he  \Q  code-words  arc-  tin-  centroid-  of 
t  he  result  mg  <  lust ers 

The  third  stage  estimates  th»  likelihood  of  t  lie  broad  pho¬ 
netic  <  la—es  given  »-a<  h  \Q  codeword  'These  probabilities  are 

estimated  from  hand  labeled  data  and  the  output  if  the  vector 
quantizer  f  liven  the  sm.i!l  number  of  broad  classes,  these  prob- 
abiht ies  e  an  be  e  st miat e-f  from  re  la?  iv ely  huh  data 

RESELLS 

The  classifier  wa>  evabn'e  )  usmt  b<  t h  four  aud  five*  broad 

phonetic  r];i-sf.«  1  he-  f  .  i  j  r  <  las-*  -  are  vrejln  (\()Cj.  v  o|c*m  j 

closure  ( \  C  I  1  noise  (\/)  arid  -ih  m  e  (Ml  )  The  fi  \  e  e  lasse-s 
difler  only  in  tfi*  r»  pla*  e  m*iii  of  the  -ingle-  e  la—  N/  by  the  two 
classes  fru  at  i  ve  (I  Iff  ’ )  and  burst  |  fE  I  ) 

Idie  1  ‘‘'C- w< -  rtf  b  x'eon  from  the  Harvard  I  M  -e-nteme-s  was 
used  to  form  the-  Markov  timdels  of  broad  phonetic  -e  piem  es 
This  |e  •xu  on  is  similar  in  complexity  to  the-  |  .  ■  r  r  e  -  Thorn  like  and 
Pocket  dictionaries  eh--i  rib*  d  abow  \  second  order  tm  d‘*l  of 
flu  Harvard  lexicon  ij-mie  *  In'  five  bo  ad  *da*ses  rec  ogmze  s  all  t  fie 
words  m\  the  '*>  r»t10  and  1  v\  .rd  dut  i<  manes  \  thud  order 

model  p-ropm/e-s  and  d  of  the  two  lexicons,  respectively 

Perf oriiian<  e-  w  as  measured  u-ing  but  fi  frame  ami  segment- 
based  sjati-lifs  die-  frame- bv  •  frame  p*-rfortnani  e-  'onif  are-  the 
best  frame  sequetu  «  output  by  t f.  >rw  ard  -  l-a- k  w  ard  algorithm 
against  the  hand  label  for  each  frame  I  hi-  y  e  lds  both  a  confu¬ 
sion  matrix  and  an  overall  pi  r<  e-nt.-e ■  »rre < t  figure  for  the  frame - 
by  -frame  e  lassifu  at  n  >n 

The  segment -base  d  per  for  mate  e*  is  compute  ]  f-\  chaining 
togetfu-r  sije-eessive  frame--  with  tie  -atm  labi  I  I  lie  re-uhmg 
segments’  are-  th-  n  e.  mpar»*d  agam-l  tli»'  fiatel  Lib*  h  d  -*  griM  nt- 
b>  comput  mg  a  be  -t  mate  h  between  the  t  w  .  ■  -«  gim-nt  s.-  jue'tn  e  s 
[  fie*  mat  'di  t-  e  on- 1  r a i ti*  I  - m  h  t  f i-i t  -<  gun  n I  -  mu-'  >  o-  >■  <  ur  m 
time  mi  order  to  be-  mat' lie. j  I  hi-  pr  >\idis  a  more  a  <urite 

(and  more  mnservat  ive)  perform un<  e*  measure  t  fian  a  be-t  st  ring 
match 

d  fie  s  \  s|  c  in  w»s  t  est  it)  separated  for  ear  fi  of  tfnrtv  male' 
spe  ake  rs  La<  h  speake  r  sai  l  ten  -*  nti,nc*,s  from  t  fie  Harvard  1  1st 
I  or  ear  h  test  .  th<*  t  raining  set  e  «*nsis!ei|  of  t  he  rein ai rung  twe-lM  \  - 
mm*  speake-r-  This  procedure  w  ,a-  u-ed  t->  est  abhsfj  t  fu-  <  ak*r 
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independence  of  the  system  Table  l  shows  the  frame-by-frame 
performance  averaged  across  all  thirty  speakers  for  both  four  and 
five  broad  phonetic  classes  Three  different  sequential  phonetic 
models  were  evaluated  The  first  row  shows  the  results  for  the 
“O-th  order”  models,  which  incorporate  durational  information 
but  no  sequential  phonetic  constraints  The  second  row  shows 
the  results  for  the  first  order  models  Since  the  results  for  the 
second  order  models  are  almost  the  same,  they  are  not  presented 

■  t  t  ‘  ”  - 1 

Order  I  Classes  .r>  ('lasses 

Zero  j  67  3°*  !  67.1** 

First  .  T.YI'c  I  T18°c  j 

Table  l  Average  correct  frame-by-frame  classification  for 
thirty  speakers  System  was  trained  and  tested  separately 
for  each  speaker;  the  test  speaker  was  not  used  for  training 

These  results  demonstrate  that  adding  sequential  phonetic 
constraints  increases  the  frame-by-frame  recognition  performance 
over  the  zero-order  model.  I'sing  more  sophisticated  segment 
recognizers  to  convert  the  segment-based  to  frame-based  models 
could  further  increase  the  frame-by-frame  performance. 

Table  2  shows  the  segment-based  performance  The  per¬ 
centage  of  segments  correctly  classified  is  reported,  along  with 
the  segment  insertion  rate  in  parentheses.  The  sequential  pho¬ 
netic  constraints  have  a  substantial  effect  in  reducing  the  segment 
insertion  rate,  without  greatly  decreasing  the  percentage  of  the 
segments  correctly  recognized  Recall  that  these  results  are  rela¬ 
tive!)  conservative  because  automatic  and  hand  labeled  segments 
must  overlap  in  time  in  order  to  be  considered  a  correct  match. 


Order 

Zero 

First 


81.5*7  (63  7%) 
74  1*7  (11.0*?) 


5  (Masses 
83.0*7  (78  4*7) 
75.1*?  (11  8*7) 


Table  2  Average  segment-based  correct  classification  for 
thirty  speakers.  The  segment  insertion  rate  is  in  parenthe- 


The  segment  insertions  for  the  first  order  models  are  highly 
regular  For  instance  66*7  of  the  insertions  in  the  5-class  condi¬ 
tion  are  VCL  in  a  FRC  VOC  context  Thus  additional  process¬ 
ing  of  the  segments  should  be  able  to  substantially  reduce  the 
11*7  insertion  rate  Without  the  sequential  constraints  the  er¬ 
ror’1  show  no  such  regular  patterns,  and  hence  further  processing 
is  not  likely  to  reduce  the  error  rate 

Syllable  Stress  Affects  Classifier  Performance 

The  classifier's  segment  deletion  rate  is  higher  in  the  un¬ 
stressed  syllables  than  in  the  stressrd  syllables  In  making  this 
comparison,  only  mono-syllabic  words  were  considered  (84*7  of 
th'’  words  in  the  utterances  are  monosyllabic).  A  word  was 
railed  unstressed  if  the  nuclear  vowel  was  reduced  transcribed 


as  a  schwa  and  otherw  ise  was  called  stressed  For  the  unstressed 
words  the  deletion  rate  was  23  7*7  whereas  for  the  stressed  words 
it  was  only  15  9%  This  result  adds  further  support  to  the  earlier 
observation  that  the  stressed  sy  llables  are  important  in  hypoth¬ 
esizing  words. 

SUMMARY 

While  lexicon  studies  demonstrate  the  power  of  broad  pho¬ 
netic  constraints  for  differentiating  words  from  one  another,  they 
do  not  suggest  how  such  constraints  can  be  directly  exploited 
in  recognition.  This  paper  has  presented  a  method  for  decou¬ 
pling  sequential  phonetic  constraints  from  a  given  lexicon,  by 
representing  allowable  broad  phonetic  sequences  in  terms  of  n-th 
order  Markov  models  Tests  of  a  simple  frame-based  broad  pho¬ 
netic  classifier  on  300  sentences  from  30  speakers  demonstrate 
that  these  models  can  be  used  to  increase  the  performance  of  a 
broad  phonetic  recognizer 
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ABSTRACT 

This  paper  describes  a  system  that  applies  vision  techniques  to 
extract  acoustic  patterns  in  the  speech  spectrogram.  By  processing 
a  spectrographic  image  through  a  set  of  edge  detectors  and  combin¬ 
ing  their  outputs,  the  system  obtains  two-dimensional  objects  that 
characterize  the  formant  patterns  and  general  spectral  properties  for 
vowels  and  consonants.  As  a  validation  of  the  approach,  a  limited 
vowel  recognition  experiment  was  performed  on  the  “object’*  spec¬ 
trograms,  Preliminary  results  show  that  this  processing  technique 
retain*  relevant  acoustic  information  necessary  to  identify  the  under¬ 
lying  phonetic  representation 


INTRODUCTION 

For  the  past  four  decades,  the  prevailing  form  for  display- 
inf;  speech  has  been  the  spectrogram,  a  three-dimensional  time- 
frequency  intensity  representation  of  the  signal.  The  spectro- 
frram  provide?  a  visual  display  of  the  relevant  temporal  and 
spectral  characteristic?  of  the  acoustic  signal  It  has  heen  an 
invaluable  tool  in  the  development  of  our  understanding  of  the 
acoustic  properties  of  speech  sounds 

Recently,  the  spcctroyraphic  display  took  on  added  signifi¬ 
cance  as  it  was  demonstrated  that  the  underlying  phonetic  rep¬ 
resentation  of  an  unknown  utterance  ran  be  extracted  almost 
entirely  from  a  visual  examination  of  the  speech  spectrogram 
[2],  [3],  [9] .  In  these  experiments,  a  trained  spectrogram  reader 
correctly  identified  the  phonetic  segments  with  80%  to  90%  ac¬ 
curacy,  depending  on  the  experimental  conditions  and  the  scor¬ 
ing  procedures.  The  reader's  performance,  measured  in  terms  of 
veuracy  and  rank-order  statistics,  was  considerably  better  than 
that  of  the  phonetic  front-ends  of  available  speech  recognition 
systems.  These  experiments  stirred  renewed  interest  in  acoustic- 
phonetic  approaches  to  speech  recognition,  and  supported  the 
speculation  that  better  front-ends  may  be  constructed  if  we  ran 
learn  the  phonetic  decoding  procedure  used  by  human  experts. 

Protocol  analysis  of  spectrogram  reading  reveals  that  the 
decodiny  process  calls  for  the  reroynition  and  inteyration  of  a 
myriad  of  acoustir  patterns.  In  order  to  develop  a  system  that 
utilir.es  such  knowledye,  one  must  first  be  able  to  extract  these 
acoustic  patterns 

This  paper  is  concerned  with  the  visual  characterisation  of 
speech  spectrograms  Our  aim  is  to  capture  the  essential  arous- 

*This  rfvir rfc  —as  supported  hy  DARTA  muter  contract  NOOOI4-82-K- 
0727,  monitored  through  the  Office  of  Naval  Research. 


tir  patterns  <>f  a  spectrogram  so  that  these  abstracted  pat¬ 
terns  may  be  used  to  characterise  and  recognise  different  speech 
sounds  Traditional  descriptions  of  acoustic-phonetic  events 
based  on  formant  frer|iiencies  are  often  inadequate  because  the 
formants  cannot  always  be  resolved  reliably  Thus  visual  char¬ 
acterisations  may  provide  an  alternative,  and  perhaps  more  ef¬ 
fective,  description 

Processiny  the  spectroyram  as  a  three-dimensional  imaye 
has  a  number  of  important  advantayes  First,  one  ran  better 
capture  the  time-frequency  dependency  of  the  speech  siynal  by 
treatiny  the  time  and  frequency  dimensions  simultaneously  Sec¬ 
ond.  we  ran  liberally  borrow  from  techniques  developed  throuyh 
many  years  of  successful  vision  research  Third,  rharartenriny 
a  sprrtroyram  is  a  Inyhly  constrained  vision  task  The  three 
dimensions  of  the  sprrtroyram  correspond  to  physically  mean- 
inyfnl  quantities,  namely,  time,  frequency,  and  amplitude  The 
patterns  on  the  spectroyram  arr  also  limited  by  the  nature  of 
the  speech  production  mechanism  and  the  restricted  sound  pat¬ 
terns  of  a  language 

SYSTEM  DESCRIPTION 

Our  approach  to  visual  rharartrnr.ation  of  speech  spertro- 
yrams  is  to  treat  the  aroustir  patterns  as  visual  objects  These 
objects  are  obtained  by  apjdyiny  edye  detertion  to  the  spectro- 
yrapbir  imaye.  produriny  an  “edye  map’  as  output  The  edye 
map  includes  explicit  information  about  the  position,  the  orien¬ 
tation,  and  the  rrlativc  strenyth  of  edyes.  These  edye  elements 
are  yToupcd  into  closed  geometrical  rontour*  The  remainder  of 
this  section  describes  the  system  in  yreater  detail,  forusiny  on 
the  vowel-like  sounds  Obstruent  sounds  have  visual  patterns 
that  are  quite  different  from  those  of  vowel-like  sounds  Their 
treatment  will  be  described  near  the  end  of  this  section 

Edge  Detection 

The  system  obtains  a  narrow-band  sport royraphic  represen¬ 
tation  by  romputiny  a  short-time  spectrum  once  every  5  ms  with 
a  25  6  ms  window.  The  vowel-like  reyinns  of  the  imaye,  deter¬ 
mined  throuyh  a  broad  phonetic  classifier  (5|,  are  then  processed 
through  two-dimensional  directional  edge  detectors  of  different 
scales.  The  rross-seetion  in  the  frequency  dimension  is  the  sec¬ 
ond  derivative  of  a  Gaussian,  and  the  rross-seetion  in  the  time 
dimension  is  a  CJaussian.  The  directional  Gaussian  edge  detec¬ 
tor  has  been  shown  by  Canny  |1]  to  have  many  useful  properties 
such  as  robustness  against  detertion  errors,  good  localisation  to 
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true  edges,  and  dimensional  separability.  Thus  this  operator 
smooths  the  spectrogram  in  the  time  dimension  and  also  de¬ 
tects  edges  that  are  approximately  orthogonal  to  the  frequency 
dimension.  Zero-crossings  of  the  filtered  output  correspond  to 
edges  in  the  original  spectrogram.  Another  advantage  of  using 
Gaussian  detectors  is  that  the  zero-crossings  do  not  disappear 
as  the  scale1  decreases  [7],  [8],  This  is  an  important  property  for 
combining  outputs  from  different  scales. 

One  potential  problem  of  using  a  directional  operator  is  that 
its  performance  might  degrade  if  the  formants  are  not  quite 
horizontal.  Multiple  directional  operators  oriented  at  different 
angles  might,  therefore,  be  needed.  However,  due  to  the  slug¬ 
gishness  of  articulatory  movements,  formant  frequencies  cannot 
change  very  quickly.  Preliminary  results  show  that  if  the  Gaus¬ 
sian  cross-section  in  the  time  dimension  is  made  small  enough 
(on  the  order  of  1  pixel),  the  edge  detector  can  pick  up  fast 
formant  movement. 

Combining  Multiple  Scale* 

Parts  (a)  and  (e)  of  Figure  1  show  the  narrow-band  and  wide¬ 
band  spectrograms,  respectively,  for  the  nonsense  word,  “boyt”, 
spoken  by  a  female  speaker.  Parts  (b),  (c),  and  (d)  show  the 
results  of  filtering  the  narrow-band  spectrogram  with  the  direc¬ 
tional  edge  detectors  of  different  scales.  The  plots  correspond 
to  »  it  of  4,  3,  and  2  pixels,  with  sigma  decreasing  from  left  to 
right  in  the  figure.  The  output  with  the  largest  scale  is  the  most 
robust  but  has  the  least  resolution,  whereas  the  one  with  the 
smallest  scale  has  the  best  resolution  but  also  has  many  extra¬ 
neous  edges.  In  order  to  achieve  robustness  and  good  resolution 
simultaneously,  these  outputs  must  be  systematically  combined. 

We  have  chosen  to  combine  the  outputs  by  performing  a 
coarse-to-fine  tracking  in  a  way  similar  to  scale-space  filtering 
proposed  by  Witkin  [7|.  This  approach  has  the  advantage  of 
managing  the  ambiguity  of  scale  in  an  organized  and  natural 
way.  Since  zero-crossings  do  not  disappear  as  the  scale  de¬ 
creases,  the  coarse-to-fine  tracking  works  properly. 

Figure  1(f)  illustrates  the  result  after  combining  edges  from 
the  different  scales.  It  can  be  seen  that  the  result  has  good 
resolution  and  is  robust. 

Applying  Speech  Knowledge 

While  coarse-to-fine  tracking  solves  the  problem  of  localizing 
large-scale  events,  it  dors  not  solve  the  multi-scale  integration 
problem  Which  of  the  edges  found  by  the  small-scale  operators 
are  robust,  and  which  edges  are  due  to  noise?  There  are  a  num¬ 
ber  of  ways  to  determine  which  edges  are  valid.  One  measure 
is  to  examine  the  amount  of  intensity  change.  The  amplitude 
of  the  output  of  the  first  derivative  Gaussian,  and  the  slope  of 
the  zero-crossings  of  the  second  derivative  Gaussian,  are  good 
indicators  of  the  amount  of  intensity  change.  However,  some 
form  of  thresholding  is  needed,  which  may  lead  to  gross  error. 

We  have  chosen,  instead,  to  apply  specific  speech  knowledge 
to  select  the  edges  We  first  apply  a  bandwidth  constraint  For 
some  vowels,  formants  can  be  quite  close  to  each  other.  Some¬ 
times  they  are  so  close  together  that  it  is  impossible  to  separate 
them  by  eye  Spectrogram  readers  are  able  to  tell  that  there  are 

'  Tke  scale  is  a  tneasqre  of  the  width  ni  an  edge  detector  For  a  Oaassiaa 
detector,  the  scale  corresponds  to  the  standard  deviation,  a. 


two  formants  because  of  the  bandwidth.  Thus  after  the  coarse- 
to-fine  tracking  is  performed,  regions  with  significantly  large 
bandwidths  are  suspected  of  having  more  than  one  formant.  In 
these  cases,  edges  from  the  smaller  operator  outputs  can  be  in¬ 
cluded  if  the  bandwidths  after  the  insertion  of  the  additional 
edges  are  still  reasonable.  This  heuristic  is  quite  robust  in  the 
vowel  regions.  To  avoid  including  spurious  edges,  however,  the 
original  bandwidth  needs  to  be  quite  large  so  as  to  trigger  in¬ 
sertion  of  edges.  This  means  that  some  of  the  good  edges  from 
the  smaller-scale  detectors  are  inadvertently  omitted.  In  order 
to  locate  these  edges,  more  elaborate  procedures  are  needed. 

For  some  vowels,  the  formants  are  quite  close  to  each  other 
for  some  duration,  but  gradually  separate  and  finally  split  apart 
After  the  formants  split,  edges  ran  be  detected  quite  reliably 
These  edges  ran  then  be  used  as  anchor  points  to  find  edges 
when  the  two  formants  approach  each  other.  As  we  have  seen 
in  Figure  1(f),  FI  and  F2  begin  to  split  apart  at  approximately 
the  midpoint  of  the  vowel.  This  kind  of  split  provides  strong 
evidence  that  more  edges  should  lie  to  the  left  of  this  point. 
These  subtle  edges  are  located  by  the  following  “digging”  pro¬ 
cedure.  Starting  from  this  point,  edges  to  the  left  are  exam¬ 
ined.  If  these  edges  satisfy  a  continuity  requirement,  they  are 
considered  “good”  edges.  Building  upon  the  extensions,  edges 
further  to  the  left  are  then  examined  This  process  repeats  un¬ 
til  no  more  edges  are  found  or  imtil  the  continuity  constraint 
is  violated  Figure  1(g)  shows  the  result  after  the  “digging" 
operation.  In  this  example,  the  operation  has  dug  through  the 
entire  region  and  correctly  located  the  first  two  formants  of  the 
vowel.  (Note  also  that  objects  with  avrrage  frequency  above 
3.SKHz  have  been  discarded,  since  they  do  not  contribute  to 
the  phonetic  identity  of  vowels.) 

The  scale-space  filtering,  augmented  with  the  above  two  pro¬ 
cedures,  is  quite  robust  in  finding  formant  edges  in  the  vowel 
regions.  At  relatively  high  frequencies,  the  detected  edges  usu¬ 
ally  correspond  to  edges  of  the  formant  frequencies.  However, 
there  is  very  often  an  rnergy  concentration  below  300  Hz  due  to 
F0.  When  FI  is  low,  this  small  energy  concentration  is  masked 
by  FI.  But  when  FI  is  higher  in  frequency,  this  energy  con¬ 
centration  becomes  more  and  more  noticeable.  Trained  spec¬ 
trogram  readers  are  very  good  at  ignoring  it.  We  are  not  yet 
sure  how  to  deal  with  these  shallow  edges  in  the  system.  At  this 
moment,  we  have  chosen  to  ignore  edge  contours  with  average 
frequency  less  than  300  Hz  if  there  is  another  edge  contour  with 
average  frequency  below  800  Hz.  This  condition  ensures  that 
the  ignored  contour  does  not  correspond  to  FI 

Processing  of  Obstruent  Regions 

Obstruents  are  characterized  by  their  general  spectral  distri¬ 
butions  rather  than  any  specific  formant  patterns  As  a  result, 
the  processing  for  the  obstruent  regions  is  considerably  different 
from  that  of  sonorant  regions.  The  obstruent  regions  are  again 
determined  by  the  broad  phonetic  classifier  A  very  coarse  edge 
detector  is  applied  to  the  ivtHr  hanH  spectral  slices,  computed 
with  a  6.7  ms  window.  The  objects  are  obtained  from  the  edge 
map  with  no  further  processing. 

Figure  1(h)  shows  the  final  result  for  the  word  “boyt,”  in¬ 
cluding  both  the  vowel  like  and  obstruent -like  regions  Compar¬ 
ing  this  figure  with  the  original  spectrogram,  we  see  that  rele- 
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rant  features  in  the  original  spectrogram  have  been  captured  in 
the  objects.  As  a  more  elaborate  example,  Figure  2(b)  shows 
the  objects  obtained  from  a  continuous  sentence  spoken  by  a 
male  speaker.  For  comparison,  the  corresponding  wide-band 
spectrogram  is  shown  in  Figure  2(a).  If  the  extracted  objects 
indeed  capture  the  important  information  in  the  spectrogram, 
then  they  can  be  used  as  a  mask  to  Biter  out  irrelevant  acous¬ 
tic  information,  as  shown  in  Figure  2(c).  We  see  that  important 
acoustic  information  in  this  utterance,  such  as  the  formant  tran¬ 
sitions  in  vowel  regions  and  the  shift  in  spectral  energy  distri¬ 
butions  in  obstruent  regions,  has  been  accurately  retained  after 
processing. 


RECOGNITION  EXPERIMENTS 

The  examples  shown  in  Figures  1  and  2,  and  informal  “object- 
reading"  experiments  performed  by  spectrogram-reading  experts, 
suggest  that  the  procedure  described  in  the  previous  section 
is  potentially  useful  in  extracting  important  acoustic  features 
from  the  spectrograms.  The  extracted  patterns  can,  for  exam¬ 
ple,  provide  the  necessary  information  for  the  development  of  a 
knowledge-based  system  for  phonetic  recognition  [10].  Alterna¬ 
tively.  one  can  build  up  an  inventory  of  these  patterns  in  order 
to  characterise  and  recognise  speech  sotinds  directly,  using  a  va¬ 
riety  of  visual  object  recognition  algorithms  (6)  Before  we  start 
tc  utilise  these  objects  in  either  of  the  two  tasks,  however,  we 
must  first  make  sure  that  these  processed  visual  patterns  in¬ 
deed  retain  the  necessary  information  for  the  recognition  of  the 
underlying  phonetic  segments, 

As  a  step  in  this  direction,  we  performed  a  small  vowel  recog¬ 
nition  experiment.  The  task  involves  the  recognition  of  14  vow¬ 
els,  /i,  t,  e,  c,  *.  a,  3,  a,  o,  u,  ?,  ay,  sy,  aw/,  spoken  in  the  /b/- 
vowel-/t/  environment  by  8  male  speakers.  Due  to  the  limited 
amount  of  available  data,  the  recognition  was  performed  using 
a  rotational  procedure;  in  each  trial  the  system  was  trained  on 
the  data  from  seven  speakers  and  tested  on  the  remaining  one. 
For  each  vowel,  the  recognizer  chose  from  the  seven  training 
samples  the  one  with  the  smallest  intra-sample  distance  as  the 
reference  template.  A  dynamic  time  warping  algorithm  [4],  with 
appropriate  local  path  constraints,  was  used  to  compensate  for 
differences  in  duration  between  the  tost  and  reference  patterns. 
No  attempt  was  made  for  normalizing  the  frequency  scale  to 
account  for  inter-speaker  differences. 

The  objects  determined  by  our  processing  system  do  not  re¬ 
tain  amplitude  information  which  is  often  useful  in  character¬ 
izing  speech  sounds.  Therefore,  we  created  from  the  objects  a 
cartoonized  spectrum  for  each  time  frame.  Regions  inside  the 
objects  were  replaced  by  a  constant  value  that  is  equal  to  the 
average  value  of  the  corr"sponding  regions  in  the  original  spec¬ 
trum.  whereas  regions  outside  were  set  to  zero.  The  cartoonized 
spectrim  was  then  smoothed  with  a  Gaussian  window.  Parts 
(a),  (b),  and  (c)  of  Figure  3  illustrate,  respectively,  a  vowel 
spectrum  (superimposed  try  an  LrC  spectrum),  the  cartoonized 
spectrum  derived  from  the  edges,  and  the  smoothed  spectrum 
used  for  recognition  A  Euclidean  distance  was  used  to  mea¬ 
sure  similarities  between  spectra.  For  comparison,  we  also  im¬ 
plemented  an  LPC-based  system  using  the  Itakura's  distance 
metric  [4]. 


The  results  of  our  vowel  recognition  experiments,  based  on 
the  112  vowel  tokens  from  eight  speakers,  show  that  the  smoothed 
spectra  can  be  used  to  identify  the  vowels  with  an  83%  first- 
choice  accuracy.  The  correct  vowel  is  within  the  top  two  choices 
94%  of  the  time.  This  result  compares  favorably  to  that  us¬ 
ing  the  LPC/Itakura-Distance  method.  While  it  is  premature 
to  base  our  conclusion  on  such  a  restricted  corpus,  we  are  nev¬ 
ertheless  encouraged  by  the  results.  It  appears  that,  for  this 
data  set  at  least,  our  processing  system  did  not  remove  acoustic 
information  that  is  necessary  for  vowel  identification. 

SUMMARY 

In  summary,  we  developed  an  algorithm  for  the  extraction  of 
visual  objects  from  speech  spectrograms.  Results  from  a  limited 
vowel  recognition  experiment  suggest  that  the  processing  tech¬ 
nique  retains  acoustic  information  that  is  useful  for  phonetic 
distinction. 

In  the  future,  we  plan  to  evaluate  this  system  more  exten¬ 
sively,  and  to  investigate  the  feasibility  of  using  the  objects  for 
phonetic  recognition. 

REFERENCES 

|1|  Cannv,  J  F.,  “Finding  Edges  and  Lines  in  Images,"  MIT-TR- 
720,  MIT. 

|2|  Cole,  R.A.,  Rudnicky,  A. I.,  Zue,  V.W.,  and  Reddy,  D  R,  “Speech 
as  Patterns  on  Paper,"  in  Perceptio a  and  Production  of  Fluent 
Spterk,  R  A.  Coir,  ed.,  Hillsdale,  NJ:  Lawrence  Erlhaum  Assoc., 
1980,  pp.  3-50. 

|3]  Cole,  R.A.  and  Zue,  V.W.,  “Speech  as  Eyes  See  It,"  in  Atten¬ 
tion  and  Performance  VIII,  R.S.  Nickerson,  ed.  Hillsdale,  NJ: 
Lawrence  Erlhaum  Assoc.,  1980,  pp.  475-494. 

|4|  Itakura,  F.,  “Minimum  Prediction  Residual  Principle  Applied 
to  Speech  Recognition,"  IEEE  Tram.  Acouti.,  Speech,  Signal 
Proceaa.,  vol.  ASSP-23,  no.  1,  pp.  67-72,  Feb.  1975. 

|5|  Leung,  H.C.  and  Zue,  V.W.,  “A  Procedure  for  Automatic  Align¬ 
ment  of  Phonetic  Transcriptions  with  Continuous  Speech"  IEEE 
Conference  Proceedingt,  ICASSP,  San  Diego,  CA,  1984,  paper 
29. 

|6|  Mart,  D.,  Virion,  W.H.  Freeman  4c  Co.,  San  Francisco,  1982. 

(7)  Witkin,  AT.,  “Scale-Space  Filtering,"  Proceeding i  of  the  Inter¬ 
national  Joint  Conference  on  Artificial  Intelligence,  pp.  1019- 
1022,  1983. 

[8|  Yuille,  A.L.  and  Poggio,  T.,  “Scaling  Theorems  for  Zero-crossings," 
AI  Memo  722,  MIT. 

|9)  Zue,  V.W.  and  Cole,  R.A. ,  “Experiments  on  Spectrogram  Read¬ 
ing,"  IEEE  Conference  Proceedingt,  ICASSP,  Washington  D.C., 
1979,  pp.  116-119. 

[10]  Zue,  V.W.  and  Lamel,  L.F.,  “An  Expert  Spectrogram  Reader 
A  Knowledge-Based  Approach  to  Speech  Recognition,"  IEEE 
Conference  Proeeedingt,  ICASSP,  Tokyo,  Japan,  1986.  paper 
23.2 


51.  1.  3 


ICASSP  86,  TOKYO 


27.VI 


From  the  Proceedings  of  ICASSP  86,  the  IEEE-IECEJ-ASJ  International  Conference  on 
Acoustics,  Speech,  and  Sianal  Processing,  held  in  Tokyo,  Japan,  April  8-11,  1986. 


AN  EXPERT  SPECTROGRAM  READER: 

A  KNOWLEDGE-BASED 
APPROACH  TO  SPEECH  RECOGNITION* 

Victor  W.  Zue  and  Lori  F.  Lamel 

Department  of  Electrical  Engineering  and  Computer  Science,  and 
Research  Laboratory  of  Electronics 
Massachusetts  Institute  of  Technology 
Cambridge,  Massachusetts  02139 


ABSTRACT 

Human  experts  can  determine  the  phonetic  identity  of  un¬ 
known  utterances  from  a  visual  examination  of  the  spectrogram 
with  performance  better  than  available  computer  systems.  The 
spectrogram-reading  process  involves  the  use  of  multiple  sources 
of  knowledge,  including  articulatory  movements,  acoustic  pho¬ 
netics.  phonotactics,  and  linguistics.  In  addition,  the  experts’ 
performance  can  be  attributed  to  their  ability  to  deal  with  par¬ 
tial  and/or  conflicting  information,  as  well  as  multiple  cues. 

This  paper  investigates  the  feasibility  of  constructing  a  know¬ 
ledge-based  system  that  mimics  the  process  of  spectrogram  read¬ 
ing  by  humans.  In  a  task  of  identifying  stop  consonants  ex¬ 
tracted  from  continuous  speech,  the  system  achieved  perfor¬ 
mance  that  is  comparable  to  that  of  the  experts. 

INTRODUCTION 

Over  the  past  four  decades  the  spectrogram,  a  three-dimen¬ 
sional  time-frrquenry-intcusity  representation  of  the  signal,  has 
been  the  single  most  widely  used  form  of  display  for  speech.  Part 
of  its  popularity  stems  from  the  fact  that  it  is  relatively  easy  to 
produce,  and  it  provides  a  visual  display  of  the  relevant  temporal 
and  spectral  characteristics  of  the  acoustic  signal.  It  has  been 
an  invaluable  tool  in  the  development  of  our  understanding  of 
the  acoustic  properties  of  speech  sounds. 

Recently,  a  series  of  experiments  by  Zue  and  his  colleagues 
demonstrated  that  the  underlying  phonetic  representation  of  an 
unknown  utterance  ran  be  recovered  almost  entirely  from  a  vi¬ 
sual  examination  of  the  speech  spectrogram  [1],  [2],  [3] .  In  their 
experiments,  a  trained  spectrogram  reader  correctly  identified 
the  phonetic  segments  with  80%  to  90%  accuracy,  depending  on 
the  experimental  conditions  and  the  scoring  procedures. 

While  the  spectrogram-reading  experiments  were  intended 
to  illustrate  the  richness  of  phonetic  information  in  the  speech 
signal,  the  results  are  relevant  to  automatic  speech  recognition 
in  several  respects.  First,  they  demonstrate  that  a  great  deal 
of  phonetic  information  ran  be  derived  from  the  acoustic  signal 
alone.  The  reader’s  performance,  measured  in  terms  of  accuracy 
and  rank-order  statistics,  was  considerably  better  than  that  of 
the  phonetic  front-ends  of  available  speech  recognition  systems. 
The  experiments  thus  provide  an  “existence  proof"  that  high- 
performance  phonetic  recognition  is  attainable.  Second,  spec¬ 
trogram  reading  is  based  on  the  recognition  and  integration  of  a 
myriad  of  acoustic  cues.  Some  of  these  cues  are  relatively  easy 
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to  identify,  while  others  are  not  meaningful  until  the  reltvant 
context  has  been  established.  One  must  selectively  attend  to 
many  different  acoustic  cues,  interpret  their  significance  in  light 
of  other  evidence,  and  make  inferences  based  on  information 
from  multiple  sources.  The  discovery  of  the  acoustic  cues  and, 
more  importantly,  of  the  control  strategies  for  utilising  these 
cues  are  the  keys  to  high-performance  phonetic  recognition.  Fi¬ 
nally,  protocol  analysis  of  the  process  of  spectrogram  reading 
reveals  that  the  decoding  process  often  involves  the  use  of  ex¬ 
plicit  rules.  Thus  the  knowledge  used  in  spectrogram  reading  is 
potentially  transferable  to  others,  both  humans  and  machines. 

Our  experience  with  spectrogram  reading  suggests  that  the 
reasoning  process  can  be  naturally  expressed  as  a  series  of  pro¬ 
duction  (or  if-then)  rules,  where  the  preconditions  and  conclu¬ 
sions  may  be  phonetic  features  or  acoustic  events.  Since  the 
acoustic-phonetic  encoding  is  highly  context-dependent  and  re¬ 
dundant,  we  must  be  able  to  entertain  multiple  hypotheses  and 
to  check  for  consistency.  Acoustic  features  are  often  expressed 
in  a  qualitative  manner  and  described  as  being  present/absent, 
and  having  values  such  as  high/mid/low,  or  weak/strong.  Thus 
in  order  to  have  the  computer  mimic  the  performance  of  spectro¬ 
gram  readers,  we  need  a  system  that  can  deal  with  qualitative 
measures  in  a  meaningful  way. 

In  this  paper,  we  report  preliminary  results  of  our  attempt 
to  incorporate  our  knowledge  about  the  spectrogTam-reading 
process  in  a  knowledge-based  system  that  mimics  the  process 
of  feature  identification  and  logical  deduction  used  by  experts. 
The  knowledge  base  explicitly  represents  the  expert's  knowledge 
in  a  way  that  is  easy  to  understand,  modify,  and  update.  Our 
research  direction  is  very  similar  to  the  efforts  by  Johanssen  et 
al.  [4]  and  Johnson  et  al.  [S|. 

TASK  DEFINITION 

The  process  of  spectrogram  reading  involves  extracting  rele¬ 
vant  acoustic  features  and  combining  these  features  using  rules 
that  relate  the  underlying  phonetic  forms  to  their  acoustic  man¬ 
ifestations.  Our  task  investigates  the  feasibility  of  developing  a 
computer  system  that  mimics  such  a  process. 

In  order  to  keep  the  project  manageable,  we  made  some  im¬ 
portant  design  restrictions.  First,  we  decided  to  focus  on  the  ac¬ 
quisition  and  formalisation  of  the  knowledge  base,  rather  than 
the  development  of  an  expert  system  itself.  As  a  result,  our  ini¬ 
tial  effort  makes  use  of  an  available  A/ycm-based[6).  backward- 
chaining1  system  Our  investigation  thus  far  has  revealed  that 
this  particular  rxpert  system  may  not  be  the  most  appropriate. 
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Nevertheless,  it  has  provided  us  with  a  convenient  mechanism 
to  acquire  and  formalise  our  knowledge,  while  freeing  us  from 
the  need  to  delve  into  a  very  difficult  research  area. 

Second,  we  bypass  the  problem  of  automatic  extraction  of 
acoustic  features.  Many  of  the  acoustic  features  used  during 
spectrogram  reading  are  readily  extracted  by  the  human  visual 
system,  but  are  very  difficult  to  extract  automatically  by  com¬ 
puter.  For  example,  there  does  not  yet  exist  a  formant  tracker 
that  can  determine  formant  frequencies  reliably,  especially  in  re¬ 
gions  where  the  direction  and  the  extent  of  formant  transitions 
provide  important  information  about  the  place  of  articulation 
for  consonants.  Thus,  while  the  measurements  were  made  auto¬ 
matically  whenever  possible,  the  acoustic  features  were  verified 
by  the  experimenter  before  being  entered  into  the  database. 
Recent  work  by  Leung  and  Zue  [7j  attempts  to  locate  two- 
dimensional  objects  directly  from  the  spectrogram.  Their  work 
on  visual  object  recognition  may  eventually  play  a  role  in  the 
feature  extraction  part  of  our  system. 

Finally,  we  selected  the  task  of  identifying  stop  consonants 
both  as  singletons  and  in  clusters,  since  the  cues  for  stop  con¬ 
sonants  are  complex,  interrelated,  and  easily  modified  by  pho¬ 
netic  context.  This  paper  reports  on  the  identification  of  word- 
initial  singleton  stop  consonants  that  appear  between  two  vow¬ 
els.  Stops  have  been  extensively  studied  and  recognition  results 
are  available  for  comparison. 


SYSTEM  DESCRIPTION 

The  development  of  our  knowledge-based  system  for  spec¬ 
trogram  reading  is  divided  into  two  parts.  First  we  select  a 
set  of  acoustic  features  that  are  important  for  phonetic  decod¬ 
ing,  and  outline  the  procedures  for  their  extraction.  Then  we 
develop  rules  that  operate  on  these  acoustic  features  to  deduce 
the  underlying  phonetic  form.  This  latter  task  involves  both  the 
formalisation  of  our  knowledge  with  respect  to  the  terminology 
and  descriptions,  and  the  actual  statements  of  the  acoustic-to- 
pbonetic  mapping.  These  two  aspects  of  the  system  are  de¬ 
scribed  next. 

Making  the  Measurement* 

Feature  Selection  The  acoustic  features  useful  for  spec¬ 
ifying  a  given  phonetic  contrast  were  initially  determined  by 
combing  the  acoustic-phonetic  literature  and  by  observing  spec¬ 
trogram  reading  sessions  conducted  by  experts.  Next,  several 
hundred  spectrograms  containing  stop  consonants  were  anno¬ 
tated  by  experts  and  studied  to  verify  the  usefulness  of  these 
cues  and  to  suggest  supplementary  measurements.  For  our  cur¬ 
rent  task  of  stop  identification,  we  obtained  acoustic  features 
that  describe  the  release  burst,  the  closure  interval,  and  the 
surrounding  contexts  These  features  include  the  voice  onset 
time  (VOT),  the  location  and  the  strength  of  the  burst,  and 
the  formant  transitions  preceding  closure  and  following  release. 
Our  system  currently  utilises  2G  acoustic  features. 

Feature  Extraction  As  stated  earlier,  at  this  moment  we 
are  not  concerned  with  the  automatic  extraction  of  the  acous¬ 
tic  features  Instead,  we  assume  that  the  measurements  of  the 

1  Msny  scarr b  problem!  f*a  be  treated  as  finding  a  path  to  a  goal  state 
from  some  initial  position.  W  hen  the  search  proceeds  from  the  initial 
ttate  toward  the  goal  state,  it  is  said  to  be  a  forward  chaining  system.  In 
contrast,  when  the  search  starts  at  the  goal  state  and  works  hack  toward 
the  initial  state,  then  it  is  said  to  be  backward  chaining. 
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Figure  It  A  display  of  the  interactive  measurement  system. 


acoustic  features  are  made  with  no  error,  and  as  a  result,  system 
performance  can  be  assessed  in  relation  to  the  adequacy  of  the 
acoustic  features,  the  rules,  and  the  control  strategy 

Some  of  the  acoustic  features  can  be  measured  reliably  with¬ 
out  human  intervention.  For  example,  the  system  can  automat¬ 
ically  determine  whether  the  following  vowel  is  rounded  from 
the  phonetic  transcription.  Some  other  measurements,  such  as 
whether  the  stop  release  is  pencil-thin,  are  qualitative  in  nature 
and  must  be  provided  by  the  expert.  Most  of  the  measurements, 
however,  can  be  made  automatically,  subject  to  verification  by 
the  expert.  For  example,  although  the  time  location  of  the 
hurst  is  a  measurement  first  made  by  the  computer,  verification 
is  necessary  partly  because  the  measurement  is  inherently  er¬ 
ror  prone,  and  partly  because  other  measurements  depend  on 
accurate  burst  location. 

To  facilitate  the  measurement  of  the  acoustic  features  by 
hand,  we  have  developed  a  semi-automatic  system  that  makes 
many  of  the  measurements  automatically  based  on  a  time-aligned 
phonetic  transcription  (8).  In  making  measurements,  the  expert 
has  available  displays  of  the  spectrogram,  the  speech  waveform, 
the  short-time  spectra,  and  energies  in  selected  frequency  bands. 
The  system  gors  through  a  checklist  of  acoustic  features,  making 
the  measurements  and  querying  the  expert  to  verify  or  modify 
them  An  example  of  the  display  used  by  the  expert  to  make  the 
measurements  is  shown  in  Figure  I.  In  this  example,  the  system 
determined  the  first  three  formants  at  the  onset  of  the  following 
vowel  without  error.  The  formant  frequencies  are  marked  by 
a  short  vertical  line,  with  associated  numerical  values,  in  the 
short-time  spectrum  window  at  the  upper  right-hand  comer  of 
the  display. 

Each  sample  in  the  database  has  an  associated  list  of  feature 
values  that  are  mostly  numerical  These  values  are  used  to 
develop  rules  and  to  test  the  knowledge-based  system 

Formalising  the  Knowledge 

Not  much  is  known  about  how  experts  approach  the  spectro¬ 
gram-reading  problem.  The  general  strategy  of  expert  spectro¬ 
gram  readers  is  to  make  some  preliminary  proposal  separating 
the  segments  into  broad  phonetic  classes.  The  candidate  set 
is  then  refined  by  incorporating  detailed  acoustic  rues  to  rile 
out  unlikrly  hypotheses.  In  our  attempt  to  rapture  this  com¬ 
plicated  problem-solving  procedure,  we  employ  several  general 
prineiples.  First,  multiple  hypotheses  based  on  diverse  acous¬ 
tic  evidence  must  be  entertained.  Second,  the  presence  of  a  rue 
may  be  useful,  but  its  absenee  need  not  be  harmful  Third,  very 
strong  evidence  of  one  kind  may  preclude  competing  hypothe¬ 
ses.  An  example  utilising  these  prineiples  is  shown  in  Figure 
2  The  place  of  articulation  of  the  stop  consonant  in  the  right- 
hand  panel  ran  be  readily  identified  as  VELAR  by  the  compart, 
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Figure  It  Spectrograms  of  /t/  and  /k/  preceding  different  vowels. 


low-frequency  burst.  No  other  information  is  necessary.  On  the 
other  hand,  the  bursts  for  the  other  two  stops  are  very  similar; 
both  are  rich  in  high-frequency  energy.  Only  after  the  vowel  con¬ 
text  is  known  can  one  infer  that  the  first  stop  is  ALVEOLAR 
(in  a  rounded  environment)  and  the  second  stop  is  VELAR  (in 
a  fronted  environment). 

In  our  system,  phonemes  are  represented  as  a  bundle  of  dis¬ 
tinctive  features  [9]  Thus,  for  example,  the  stop  /t/,  has  the 
features:  STOP,  VOICELESS,  ALVEOLAR.  A  stop  is  identi¬ 
fied  when  there  is  strong  evidence  for  the  presence  of  its  distinc¬ 
tive  features.  Our  system  uses  three  stages  to  identify  stops. 
First,  the  phonemes  are  mapped  into  a  set  of  distinctive  fea¬ 
tures  Next,  the  numerical  values  of  the  acoustic  features  are 
mapped  into  a  set  of  qualitative  descriptions,  such  as  high/low 
and  strong/weak.  Finally,  a  set  of  relatively  independent  rules 
deduce  each  distinctive  feature  from  the  qualitative  descriptions. 

Structure  of  the  Rules  There  are  several  types  of  rules  in 
our  system,  each  dealing  with  a  particular  transformation  of  the 
data.  First,  there  are  rules  that  define  the  relationship  between 
a  phoneme  and  its  distinctive  feature  values.  For  example,  the 
stop  / 1/  is  defined  by  the  following  rule: 

If  the  voicing  of  the  stop  is  voiceleu, 

and  the  place  of  articulation  of  the  stop  is  alveolar, 

then  the  identity  of  the  stop  is  /t/. 

All  of  the  stops  are  defined  in  the  same  manner.  Thus  we 
have  converted  the  problem  of  deducing  the  identity  of  a  stop 
to  one  of  determining  its  voicing  and  place  characteristics. 

When  experts  read  spectrograms  they  use  their  visual  system 
to  extract  features  in  the  image.  Then,  using  a  wealth  of  knowl¬ 
edge,  they  combine  these  features  to  form  phonetic  hypotheses. 
Experts  use  qualitative  descriptions,  such  as  a  second  formant 
that  is  low,  mid,  or  high,  but  rarely  specify  numeric  values.  Al¬ 
though  they  have  an  intuitive  sense  of  what  these  terms  mean, 
experts  may  have  difficulty  quantifying  them  reliably. 

In  order  to  simulate  this  process,  a  set  of  rules  has  been 
developed  to  map  the  numerical  values  of  the  acoustic  measure¬ 
ments  into  qualitative  descriptions.  The  mapping  ranges  have 
all  been  hand-selected  from  histograms.  Generally  the  qualita¬ 
tive  descriptions  are  associated  with  disjoint  numerical  regions. 
Measurements  that  fall  between  regions  are  associated  with  both 
labels,  each  with  a  lower  confidence  factor. 


The  last  set  of  rules  deduces  the  distinctive  features  from 
the  acoustic  descriptions.  Two  examples  of  rules  that  deduce 
_  the  feature  VOICING  are  shown  below: 

If  the  VOT  is  thort, 

and  the  following  vowel  is  not  a  tchuia, 

then  the  stop  is  voiced. 

If  there  is  prevoicing  during  closure, 
then  the  stop  is  voiced. 


The  second  example  reflects  the  asymmetry  of  some  of  the 
7  acoustic  cues;  in  this  case  the  presence  of  prevoicing  is  a  good  in¬ 
dicator  for  a  voiced  stop  whereas  the  absence  of  this  cue  does  not 
necessarily  rule  out  a  voiced  stop.  Note  also  that  the  strength 
of  a  rule's  conclusions  depends  upon  the  belief  in  the  precon¬ 
ditions.  If  one  is  uncertain  about  the  acoustic  measurements, 
multiple  rules  ran  be  fired,  each  with  a  lower  confidence  factor. 

Control  Strategy  Mycin  uses  a  very  simple  goal-directed 
control  strategy.  It  sets  off  to  determine  the  identity  of  the  stop, 
and  in  the  process  needs  to  deduce  its  voicing  and  place  charac¬ 
teristics.  In  each  case,  the  system  will  exhaustively  fire  all  the 
pertinent  rules.  We  are  able  to  affect  the  control  strategy  some¬ 
what  by  including  preconditions  that  inhibit  certain  rules  from 
firing.  For  example,  if  the  stop  release  is  very  weak,  one  should 
not  pay  attention  to  the  frequency  location  of  the  burst,  as  it 
will  be  unreliable.  As  another  example,  the  formant  transitions 
for  voiced  stops  are  measured  after  voicing  onset.  However,  for 
voiceless  stops,  the  same  measurements  are  made  during  aspi¬ 
ration,  since  the  transitions  are  already  completed  by  voicing 
onset. 

EXPERIMENTAL  RESULTS 

To  test  the  effectiveness  of  our  system  we  performed  a  stop 
identification  experiment  in  which  the  stops  are  known  to  be 
word-initial  and  to  appear  between  two  vowels.  We  greatly  re¬ 
duced  the  complexity  of  the  problem  by  restricting  our  infor¬ 
mation  to  the  segment  to  be  identified  and  its  immediate  neigh¬ 
bors.  In  making  the  measurements,  the  system  was  provided 
with  knowledge  of  the  vowel  contexts  and  with  time  points  that 
roughly  correspond  to  the  points  of  closure,  release,  and  voic¬ 
ing  onset.  Refined  time-points  and  other  measurements  were 
determined  using  the  interactive  system  described  earlier. 

Data  Description 

Two  hundred  intervocalic  stops  were  randomly  selected  from 
a  database  of  1,000  sentences  spoken  by  100  speakers,  50  male 
and  50  female.  One  hundred  tokens  were  used  for  system  train¬ 
ing,  and  100  for  system  testing.  The  stops  for  the  training  and 
test  sets  were  obtained  from  64  speakers;  45  appear  in  both  data 
sets  There  was  no  restriction  on  the  vowels;  in  fart,  some  of  the 
stops  preceded  a  schwa  In  order  to  compare  the  system's  per¬ 
formance  to  human  performance,  spectrograms  of  the  training 
and  testing  samples  were  read  by  five  experts. 

System  training  involves  selecting  the  acoustic  features,  set¬ 
ting  the  thresholds  for  the  mapping  functions,  and  formulating 
the  rules  Rule  development  is  an  iterative  process;  an  initial  set 
of  rules  is  proposed  and  tested  on  a  subset  of  training  samples. 
By  examining  the  output  of  the  system,  the  experimenter  refines 
the  rules  and  tests  them  on  other  training  samples  The  process 
continues  until  the  system  behavior  is  judged  to  be  satisfactory 
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first 

top  2 

condition 

choice 

choice 

accuracy 

accuracy 

training  buman(2) 

90 

92 

system 

88 

95 

testing  human(3) 

92 

96 

system 

84 

92 

Tabic  li  Comparison  of  human  and  system  identification  perfor¬ 
mance 


Performance  Evalnation  and  Disenasion 

Table  1  summarizes  the  results  of  our  experiments.  For  the 
training  data,  the  system's  performance  is  comparable  to  that 
of  the  experts.  The  performance  of  the  system  degraded  by  4% 
when  it  was  confronted  with  new  data,  whereas  the  experts' 
performance  on  the  test  data  remains  high.  We  attribute  the 
degradation  of  performance  from  training  to  test  data  primarily 
to  the  ‘lack  of  experience”  of  the  system;  it  has  not  yet  learned 
all  the  acoustic  features  and  rules  used  by  the  experts.  Most  of 
the  errors  are  not  due  to  new  speakers,  and  there  is  no  obvious 
male/female  bias. 

Table  2  displays  the  confusion  matrix  on  the  system’s  first 
choice  identification  for  the  test  data.  All  but  one  of  the  errors 
are  in  identifying  the  place  of  articulation.  Ten  of  the  16  er¬ 
rors  involve  the  VELAR  place  of  articulation.  Examination  of 
the  spectrograms  reveals  that  most  of  the  errors  made  by  the 
system  are  judged  to  be  reasonable  by  experts.  For  example, 
/t/-/k/  confusion  usually  occurs  when  the  /t /  is  rounded,  /k/- 
/t/  confusion  when  the  /k /  is  fronted,  and  /k/-/p/  confusion 
when  the  /k/  is  back  and  has  a  weak  release. 

We  are  encouraged  by  the  initial  performance  results  of  our 
system.  Although  the  system  did  not  perform  as  well  as  hu¬ 
man  experts,  our  results  are  comparable  to  stop  recognition 
results  reported  in  the  literature  on  similar  tasks.  While  stops 
have  been  extensively  studied,  most  recognition  experiments  re¬ 
ported  have  been  on  word-initial  stops  in  isolated  words  and/or 
pre-stressed  position.  The  recognit.on  task  closest  to  our  own 
was  reported  by  Demichelis  et  al  [10].  Using  acoustic  features 
that  were  combined  with  fuzzy  logic  and  rules,  they  achieved 
recognition  rates  of  90  92%  for  stops  in  continuous  speech. 

SUMMARY 

We  believe  that  we  are  making  headway  in  our  efforts  to  cap¬ 
ture  the  knowledge  used  by  experts  in  the  spectrogram-reading 
task,  and  to  encode  that  knowledge  into  features  and  rules. 
While  the  rule  set  is  still  incomplete,  we  feel  that  the  rules 
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express  our  knowledge  succinctly.  As  stated  earlier,  rule  devel¬ 
opment  is  an  iterative  and  interactive  process.  Each  iteration 
improves  our  knowledge  and  understanding,  which  is  then  re¬ 
flected  in  the  system  design  and  performance.  As  more  and 
more  data  is  used  for  training,  statistical  techniques  can  be  em¬ 
ployed  to  arrive  at  a  more  accurate  measurement-to-description 
mapping. 

While  the  performance  of  the  system  can  be  improved,  the 
current  implementation  does  not  accurately  model  the  problem¬ 
solving  procedure  used  by  human  experts.  This  is  partly  due  to 
limitations  imposed  by  the  structure  of  the  A/yetrv-based  expert 
system  that  we  are  using.  The  goal-directed,  backward-chaining 
inferencing  of  Mgcin  does  not  enable  the  system  to  evaluate  mul¬ 
tiple  hypothesis  at  any  given  time.  As  a  practical  matter  this 
makes  the  system  harder  to  use  and  debug.  In  contrast,  ex¬ 
perts  tend  to  do  forward  induction,  and  to  keep  a  set  of  possible 
candidates.  In  the  future,  we  plan  to  implement  our  rules  in 
a  forward  chaining  system  that  better  models  expert  behavior. 
We  also  intend  to  evaluate  the  system  more  extensively,  and  to 
increase  the  complexity  of  the  task  by  extending  the  recognition 
to  include  impostor*  and  stops  in  cluster*. 
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Utilizing  Speech-Specific  Knowledge  in  Automatic  Speech  Recognition1 

Victor  W.  Zue 

Department  of  Electrical  Engineering  and  Computer  Science 
Massachusetts  Institute  of  Technology 
Cambridge,  MA  02139  USA 

In  automatic  speech  recognition,  the  acoustic  signal  is  the  only  tangible  con¬ 
nection  between  the  talker  and  the  machine.  While  the  signal  conveys  linguistic 
information,  it  also  contains  extralinguistic  information  about  such  matters  as  the 
identity  of  the  speaker,  his  or  her  physiological  and  psychological  states,  and  the 
acoustic  environment.  1  believe  that  successful  speech  recognition  is  possible  only 
if  we  can  determine  ways  to  extract  the  linguistic  information  while  discarding  ir¬ 
relevant  information. 

Over  the  past  three  decades,  we  have  made  slow  but  steady  progress  in  research¬ 
ing  the  complex  relationship  between  the  underlying  linguistic  representations  of  an 
utterance  and  its  various  acoustic  realisations.  While  decades  may  pass  before  we 
reach  a  full  understanding,  we  may  still  derive  near-term  benefits  from  the  increased 
utilization  of  speech  knowledge  in  speech  recognition  algorithms.  The  benefits  can 
take  the  form  of  better  algorithm  performance  or  reduced  sensitivity  of  systems  to 
variations  in  speaker  and  environment. 

In  my  presentation,  I  will  suggest  the  following: 

•  Signal  representation  based  on  human  auditory  system  may  be  important  in 
enhancing  phonetic  contrasts. 

•  Performance  of  pattern  recognition  algorithms  may  be  improved  when  aug¬ 
mented  with  speech  knowledge. 

•  New  models  of  speech  recognition  utilizing  constraints  imposed  by  the  lan¬ 
guage  may  be  effective. 

•  Optimum  utilization  of  incomplete  acoustic-phonetic  knowledge  in  the  form 
of  ignorance  modeling  may  be  important. 


1  Research  supported  by  DAJIPA  contract  N0001I-82-K-0727,  as  monitored  by  the  Office  of  Naval 
Research. 
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